Carbon footprint - genotoul-bioinfo

Being mindful of the environmental impact of our activity is crucial in order to aim for a sustainable future. Indeed, the field of big data is expanding with the implied consequences – e.g. energy consumption, data storage and important carbon footprint.

Information technology and computing clusters have a substantial environmental impact, primarily through their intensive consumption of primary resources. These systems rely heavily on rare and non-renewable resources such as rare earth metals, which are essential for manufacturing semiconductors and high-performance computing components. The extraction and processing of these resources can lead to habitat destruction, pollution, and ecosystem disruption. Additionally, the energy-intensive nature of data centers and clusters, required for the processing and storage of data, places significant demands on water resources for cooling purposes.

Our impact

We compute Genotoul carbon footprint for cluster acquire in 2023 and use in 2024. Here are the results.

This analysis was also computed in 2019 on our previous infrastructure. Results can be found here.

Some advice

Here are some good practices to save time and reduce your carbon footprint.

Firstly,

Considering how urgent the situation is, making the research stakeholders aware of these issues is key to achieve the necessary sustainable goals for humanity to survive.

For extensive analyses, consider whether launching the analysis is truly necessary.
If you have a substantial number of similar processes (e.g., numerous fastq files to analyze),
- launch the process on a single file to assess your memory, CPU, and time requirements.
- Utilize the ‘seff’ command to gather information about the completed job,
- then launch the job array with the adjusted parameters.
If you are using a workflow manager,

- review the default resources in the configuration and reduce them if necessary. You can also employ the aforementioned approach to fine-tune the values.
- Pause processing at quality check steps to perform them only if the quality is sufficient for the next step.

——

Here is the table 1 “Carbon Footprint of a Range of Bioinformatic Tasks.” of the article titled “The Carbon Footprint of Bioinformatics,” published in Molecular Biology and Evolution, Volume 39, Issue 3, March 2022:

This article investigates the carbon footprint associated with bioinformatics, a field that utilizes computational tools to analyze and interpret biological data. The authors explore various stages of the bioinformatics process, including data collection, analysis, storage, and sharing.

The study assesses how these different stages contribute to greenhouse gas emissions and examines factors influencing the carbon footprint of bioinformatics. The authors also discuss the implications of their findings, emphasizing the importance of considering environmental impact in bioinformatics research.

In summary, the article highlights the need to be aware of the carbon footprint associated with bioinformatics and suggests avenues for reducing its environmental impact.

Task	Tool	Version	Details about the Experiments	Carbon Footprint		Tree-months	km in a Car (EU)	Running Time and Memory	Approximate Scaling (if known)
Task	Tool	Version	Details about the Experiments	Increase (%)	kgCO₂e	Tree-months	km in a Car (EU)	Running Time and Memory	Approximate Scaling (if known)
Genome scaffolding	SSPACE	2.0	Scaffolding 2.4 million long reads from human chromosome 14 (Hunt et al. 2014).	—	0.0010	0.0011	0.01	3 min 21 s	Linearly with number of reads.
	SSPACE	2.0		—	0.0010	0.0011	0.01	30 GB
	SOAPdenovo2	r223		+45%	0.0015	0.0016	0.01	4 min 52 s
	SOAPdenovo2	r223		+45%	0.0015	0.0016	0.01	30 GB
	SGA	0.9.43		+2,752%	0.029	0.032	0.17	1 h 35 min
	SGA	0.9.43		+2,752%	0.029	0.032	0.17	30 GB
Genome scaffolding	SSPACE	2.0	Scaffolding 23 million short reads from human chromosome 14 (Hunt et al. 2014).	—	0.0027	0.0029	0.02	8 min 40 s
	SSPACE	2.0		—	0.0027	0.0029	0.02	30 GB
	SOAPdenovo2	r223		+34%	0.0036	0.0039	0.02	1 min 38 s
	SOAPdenovo2	r223		+34%	0.0036	0.0039	0.02	30 GB
	SGA	0.9.43		+4,801%	0.13	0.14	0.74	7 h 05 min
	SGA	0.9.43		+4,801%	0.13	0.14	0.74	30 GB
Genome assembly	Abyss	2.0	De novo assembly of a human genome from Illumina sequencing reads (Jackman et al. 2017).	—	11	12	61	20 h
	Abyss	2.0		—	11	12	61	34 GB
	MEGAHIT	1.0.6		+42%	15	16	86	26 h
	MEGAHIT	1.0.6		+42%	15	16	86	197 GB
Metagenome assembly	MetaVelvet k101	1.2.01	Metagenome assembly from 100 soil samples (Vollmers et al. 2017).	—	14	16	82	1 h 06 min
	MetaVelvet k101	1.2.01		—	14	16	82	130 GB
	MEGAHIT	1.0.3		+438%	77	84	439	15 h 36 min
	MEGAHIT	1.0.3		+438%	77	84	439	12 GB
	metaSPAdes	3.8.0		+1,206%	186	203	1,065	29 h 24 min
	metaSPAdes	3.8.0		+1,206%	186	203	1,065	60 GB
Metagenome classification (short read)	Kraken2	2.0.7	Metagenomic classification of 5 Gb of randomly sampled reads from Zymo mock community (batch ZRC190633), containing yeast, Gram-negative, and positive bacteria (Dilthey et al. 2019)	—	0.0052	0.0057	0.03	20 min	Linearly with number of reads.
	Kraken2	2.0.7		—	0.0052	0.0057	0.03	21 GB
	Centrifuge	1.0.4		+141%	0.013	0.014	0.07	58 min
	Centrifuge	1.0.4		+141%	0.013	0.014	0.07	12 GB
	Kraken/Bracken	0.10.5/1.0.0		+1,650%	0.092	0.10	0.52	1 h 40 min
	Kraken/Bracken	0.10.5/1.0.0		+1,650%	0.092	0.10	0.52	154 GB
Metagenome classification (long read)	MetaMaps	—		—	18.25	19.91	104.27	209 h 53 min
Metagenome classification (long read)	MetaMaps	—		—	18.25	19.91	104.27	262 GB
Phylogenetics	BEAST/BEAGLE	1.8.4/2.1.2	Codon substitution modeling of extant carnivores and a pangolin group. Nucleotide substitution and phylogeographic modeling of Ebola virus genomes. See supplementary table 2, Supplementary Material online, for detailed results (Baele et al. 2019).	—	0.012–0.30	0.013–0.33	0.069–1.72	3 min 30 s to 7 h 45 min	Power law with number of loci.
Phylogenetics	BEAST/BEAGLE	1.8.4/2.1.2		—	0.012–0.30	0.013–0.33	0.069–1.72	2–8 GB	Power law with number of loci.
Phylogenetics	RAxml/ExaML, PhyML, IQ-TREE, FastTree	8.2.0/3.0.17, 20160530 1.4.2, 2.1.9	Over 670,000 tree inferences on about 45,000 single-gene alignments and supermatrices from 19 empirical phylogenomic data sets with thousands of genes and around 200 taxa. (Zhou et al. 2018)	—	3565	3889	20,371	300,000 h
Phylogenetics	RAxml/ExaML, PhyML, IQ-TREE, FastTree	8.2.0/3.0.17, 20160530 1.4.2, 2.1.9		—	3565	3889	20,371	8 GB
Phylogenetics	ExaML	—	A 322-million-bp MULTIZ alignment of putatively orthologous genome regions across all species, comprising approximately 30% of an average assembled avian genome. This corresponded to the maximal orthologous sequence obtainable across all orders of Neoaves.(Jarvis et al. 2014)	—	4372	4769	24,983	367,920 h
Phylogenetics	ExaML	—		—	4372	4769	24,983	8 GB
RNA read alignment	HISAT2	2.0.0beta	Alignment of 10 million 100-base read pairs to Homo Sapiens hg19 genome (Baruzzo et al. 2017).	—	0.0054	0.0059	0.031	1 min 48 s	Linearly with number of reads.
	HISAT2	2.0.0beta		—	0.0054	0.0059	0.031	5 GB
	STAR	2.5.0a		+78%	0.0097	0.011	0.055	6 min 01 s
	STAR	2.5.0a		+78%	0.0097	0.011	0.055	35 GB
	TopHat2	2.1.0		+5,756%	0.32	0.35	1.81	2 h 14 min
	TopHat2	2.1.0		+5,756%	0.32	0.35	1.81	16 GB
	Novoalign	3.02.13		+17,926%	0.98	1.07	5.58	32 h 12 min
	Novoalign	3.02.13		+17,926%	0.98	1.07	5.58	64 GB
RNA read alignment	HISAT2	2.0.0beta	Alignment of 10 million 100-base read pairs to Plasmodium falciparum genome (Baruzzo et al. 2017).	—	0.0052	0.0057	0.030	1 min 44 s
	HISAT2	2.0.0beta		—	0.0052	0.0057	0.030	1 GB
	TopHat2	2.1.0		+4,519%	0.24	0.26	1.37	1 h 25 min
	TopHat2	2.1.0		+4,519%	0.24	0.26	1.37	13 GB
	STAR	2.5.0a		+7,025%	0.37	0.40	2.11	2 h 27 min
	STAR	2.5.0a		+7,025%	0.37	0.40	2.11	8 GB
	Novoalign	3.02.13		+12,847%	0.67	0.73	3.83	38 h 04 min
	Novoalign	3.02.13		+12,847%	0.67	0.73	3.83	21 GB
RNA-seq QC pipeline	FastQC, TrimGalore, bbmap/clumpify, and STAR	-/v0.6.0/-/v2.7.0e	Quality control analysis of raw reads quality of 392 samples from the Childhood Asthma Study (in-house).	—	54.97	59.97	314.11	485 h 12 min
RNA-seq QC pipeline	FastQC, TrimGalore, bbmap/clumpify, and STAR	-/v0.6.0/-/v2.7.0e		—	54.97	59.97	314.11	8 GB
Transcript isoform abundance estimation	Sailfish 1 core	0.6.3	Transcript isoform quantification of 100 million in silico reads generated from Flux Simulator with hg19 genome and GENCODE v19 annotation set (Kanitz et al. 2015)	—	0.0081	0.0088	0.046	42 min	Linearly with the number of reads.
	Sailfish 1 core			—	0.0081	0.0088	0.046	7 GB
	Sailfish 16 cores			+344%	0.036	0.039	0.21	14 min
	Sailfish 16 cores			+344%	0.036	0.039	0.21	7 GB
	Cufflinks 1 core	2.1.1		+451%	0.045	0.049	0.26	3 h 30 min
	Cufflinks 1 core			+451%	0.045	0.049	0.26	11 GB
	Cufflinks 16 cores			+3,262%	0.27	0.30	1.56	1 h 45 min
	Cufflinks 16 cores			+3,262%	0.27	0.30	1.56	12 GB
	RSEM 1 core	1.2.18		+6,982%	0.57	0.63	3.28	47 h 10 min
	RSEM 1 core			+6,982%	0.57	0.63	3.28	9 GB
	RSEM 16 cores			+17,162%	1.40	1.53	8.00	8 h 50 min
	RSEM 16 cores			+17,162%	1.40	1.53	8.00	21 GB
GWAS	Bolt-LMM	2.3	Analyses of a single trait in UK Biobank (N = 500,000) (Loh et al. 2018)	—	4.70	5.13	26.87	60 h 58 min	Linearly with number of variants.
	Bolt-LMM	2.3		—	4.70	5.13	26.87	100 GB
	Bolt-LMM	1.0		+268%	17.29	18.86	98.81	224 h 10 min
	Bolt-LMM	1.0		+268%	17.29	18.86	98.81	100 GB
Cohort scale eQTL analysis	TensorQTL	1.0.2	Cis-eQTL mapping of 10.7 M SNPs against 18,373 genetic features in a cohort of 2,745 individuals (in-house).	—	2.04	2.22	11.7	1 h 14 min	Nonlinearly with the number of traits or the sample size.
	TensorQTL	1.0.2		—	2.04	2.22	11.7	192 GB
	LIMIX	2.0.3		+9,256%	190.73	208.07	1,089.9	9,705 h
	LIMIX	2.0.3		+9,256%	190.73	208.07	1,089.9	41–221 GB
Single cis-eQTL gene mapping	TensorQTL	—	Cis-eQTL mapping one gene from skeletal muscle in GTEx (v6p) (Taylor-Weiner et al. 2019).	—	0.00001	0.00001	0.00004	0.11 s
	TensorQTL	—		—	0.00001	0.00001	0.00004	52 GB
	FastQTL	—		+2,681%	0.0002	0.0002	0.001	30 s
	FastQTL	—		+2,681%	0.0002	0.0002	0.001	52 GB
Molecular dynamics simulation	AMBER	18	Simulation of a Satellite Tobacco Mosaic Virus with 1,066,628 atoms for 100 ns^a (NAMD Performance n.d.; The Pmemd.Cuda GPU Implementation n.d.).	—	18	19	102	75 h
	AMBER	18		—	18	19	102	(^b)
	NAMD	2.13		+433%	95	104	544	400 h
	NAMD	2.13		+433%	95	104	544	(^b)
Molecular docking	Glide	57111	Molecular docking of four DUD systems, scaled to 1 m ligands (Ruiz-Carmona et al. 2014)	—	13	14	74	1,027 h 47 min
	Glide	57111		—	13	14	74	0.05 GB
	rDock	—		+1,092%	154	168	878	12,250 h
	rDock	—		+1,092%	154	168	878	0.05 GB
	AutoDock Vina	—		+3,886%	514	561	2,938	40,972 h
	AutoDock Vina	—		+3,886%	514	561	2,938	0.05 GB

Loïc Lannelongue et al, in 2021, in Green Algorithms: Quantifying the Carbon Footprint of Computation propose a carbon footprint calculator : https://calculator.green-algorithms.org/.

The processors in our cluster are not available in the list, so if you want to use it for your jobs on our cluster you can select “other” and then set the TDP (Thermal Design Power) per core to 3.5. What’s more, the PUE (Power Usage Efficiency) of the datacentre hosting our machines is 1.4.