Practices to reduce your carbon footprint

Being mindful of the environmental impact of our activity is crucial in order to aim for a sustainable future. Indeed, the field of big data is expanding with the implied consequences – e.g. energy consumption, data storage and important carbon footprint.

Information technology and computing clusters have a substantial environmental impact, primarily through their intensive consumption of primary resources. These systems rely heavily on rare and non-renewable resources such as rare earth metals, which are essential for manufacturing semiconductors and high-performance computing components. The extraction and processing of these resources can lead to habitat destruction, pollution, and ecosystem disruption. Additionally, the energy-intensive nature of data centers and clusters, required for the processing and storage of data, places significant demands on water resources for cooling purposes.

Here are some good practices to save time and reduce your carbon footprint.

Firstly,

Considering how urgent the situation is, making the research stakeholders aware of these issues is key to achieve the necessary sustainable goals for humanity to survive.

  1. For extensive analyses, consider whether launching the analysis is truly necessary.

  2. If you have a substantial number of similar processes (e.g., numerous fastq files to analyze),

    • launch the process on a single file to assess your memory, CPU, and time requirements.

    • Utilize the ‘seff’ command to gather information about the completed job,

    • then launch the job array with the adjusted parameters.

  3. If you are using a workflow manager,
    • review the default resources in the configuration and reduce them if necessary. You can also employ the aforementioned approach to fine-tune the values.
    • Pause processing at quality check steps to perform them only if the quality is sufficient for the next step.

——

Here is the table 1 “Carbon Footprint of a Range of Bioinformatic Tasks.” of the article titled “The Carbon Footprint of Bioinformatics,” published in Molecular Biology and Evolution, Volume 39, Issue 3, March 2022:

This article investigates the carbon footprint associated with bioinformatics, a field that utilizes computational tools to analyze and interpret biological data. The authors explore various stages of the bioinformatics process, including data collection, analysis, storage, and sharing.

The study assesses how these different stages contribute to greenhouse gas emissions and examines factors influencing the carbon footprint of bioinformatics. The authors also discuss the implications of their findings, emphasizing the importance of considering environmental impact in bioinformatics research.

In summary, the article highlights the need to be aware of the carbon footprint associated with bioinformatics and suggests avenues for reducing its environmental impact.

 

Task Tool Version Details about the Experiments Carbon Footprint


Tree-months km in a Car (EU) Running Time and Memory Approximate Scaling (if known)
Increase (%) kgCO2e
Genome scaffolding SSPACE 2.0 Scaffolding 2.4 million long reads from human chromosome 14 (Hunt et al. 2014). 0.0010 0.0011 0.01 3 min 21 s Linearly with number of reads.
30 GB
SOAPdenovo2 r223 +45% 0.0015 0.0016 0.01 4 min 52 s
30 GB
SGA 0.9.43 +2,752% 0.029 0.032 0.17 1 h 35 min
30 GB
Genome scaffolding SSPACE 2.0 Scaffolding 23 million short reads from human chromosome 14 (Hunt et al. 2014). 0.0027 0.0029 0.02 8 min 40 s
30 GB
SOAPdenovo2 r223 +34% 0.0036 0.0039 0.02 1 min 38 s
30 GB
SGA 0.9.43 +4,801% 0.13 0.14 0.74 7 h 05 min
30 GB
Genome assembly Abyss 2.0 De novo assembly of a human genome from Illumina sequencing reads (Jackman et al. 2017). 11 12 61 20 h
34 GB
MEGAHIT 1.0.6 +42% 15 16 86 26 h
197 GB
Metagenome assembly MetaVelvet k101 1.2.01 Metagenome assembly from 100 soil samples (Vollmers et al. 2017). 14 16 82 1 h 06 min  
130 GB
MEGAHIT 1.0.3 +438% 77 84 439 15 h 36 min
12 GB
metaSPAdes 3.8.0 +1,206% 186 203 1,065 29 h 24 min
60 GB
Metagenome classification (short read) Kraken2 2.0.7 Metagenomic classification of 5 Gb of randomly sampled reads from Zymo mock community (batch ZRC190633), containing yeast, Gram-negative, and positive bacteria (Dilthey et al. 2019) 0.0052 0.0057 0.03 20 min Linearly with number of reads.
21 GB
Centrifuge 1.0.4 +141% 0.013 0.014 0.07 58 min
12 GB
Kraken/Bracken 0.10.5/1.0.0 +1,650% 0.092 0.10 0.52 1 h 40 min
154 GB
Metagenome classification (long read) MetaMaps 18.25 19.91 104.27 209 h 53 min
262 GB
Phylogenetics BEAST/BEAGLE 1.8.4/2.1.2 Codon substitution modeling of extant carnivores and a pangolin group. Nucleotide substitution and phylogeographic modeling of Ebola virus genomes. See supplementary table 2, Supplementary Material online, for detailed results (Baele et al. 2019). 0.012–0.30 0.013–0.33 0.069–1.72 3 min 30 s to 7 h 45 min Power law with number of loci.
2–8 GB
Phylogenetics RAxml/ExaML, PhyML, IQ-TREE, FastTree 8.2.0/3.0.17, 20160530 1.4.2, 2.1.9 Over 670,000 tree inferences on about 45,000 single-gene alignments and supermatrices from 19 empirical phylogenomic data sets with thousands of genes and around 200 taxa. (Zhou et al. 2018) 3565 3889 20,371 300,000 h  
8 GB
Phylogenetics ExaML A 322-million-bp MULTIZ alignment of putatively orthologous genome regions across all species, comprising approximately 30% of an average assembled avian genome. This corresponded to the maximal orthologous sequence obtainable across all orders of Neoaves.(Jarvis et al. 2014) 4372 4769 24,983 367,920 h
8 GB
RNA read alignment HISAT2 2.0.0beta Alignment of 10 million 100-base read pairs to Homo Sapiens hg19 genome (Baruzzo et al. 2017). 0.0054 0.0059 0.031 1 min 48 s Linearly with number of reads.
5 GB
STAR 2.5.0a +78% 0.0097 0.011 0.055 6 min 01 s
35 GB
TopHat2 2.1.0 +5,756% 0.32 0.35 1.81 2 h 14 min
16 GB
Novoalign 3.02.13 +17,926% 0.98 1.07 5.58 32 h 12 min
64 GB
RNA read alignment HISAT2 2.0.0beta Alignment of 10 million 100-base read pairs to Plasmodium falciparum genome (Baruzzo et al. 2017). 0.0052 0.0057 0.030 1 min 44 s
1 GB
TopHat2 2.1.0 +4,519% 0.24 0.26 1.37 1 h 25 min
13 GB
STAR 2.5.0a +7,025% 0.37 0.40 2.11 2 h 27 min
8 GB
Novoalign 3.02.13 +12,847% 0.67 0.73 3.83 38 h 04 min
21 GB
RNA-seq QC pipeline FastQC, TrimGalore, bbmap/clumpify, and STAR -/v0.6.0/-/v2.7.0e Quality control analysis of raw reads quality of 392 samples from the Childhood Asthma Study (in-house). 54.97 59.97 314.11 485 h 12 min
8 GB
Transcript isoform abundance estimation Sailfish 1 core 0.6.3 Transcript isoform quantification of 100 million in silico reads generated from Flux Simulator with hg19 genome and GENCODE v19 annotation set (Kanitz et al. 2015) 0.0081 0.0088 0.046 42 min Linearly with the number of reads.
7 GB
Sailfish 16 cores +344% 0.036 0.039 0.21 14 min
7 GB
Cufflinks 1 core 2.1.1 +451% 0.045 0.049 0.26 3 h 30 min
11 GB
Cufflinks 16 cores +3,262% 0.27 0.30 1.56 1 h 45 min
12 GB
RSEM 1 core 1.2.18 +6,982% 0.57 0.63 3.28 47 h 10 min
9 GB
RSEM 16 cores +17,162% 1.40 1.53 8.00 8 h 50 min
21 GB
GWAS Bolt-LMM 2.3 Analyses of a single trait in UK Biobank (N = 500,000) (Loh et al. 2018) 4.70 5.13 26.87 60 h 58 min Linearly with number of variants.
100 GB
Bolt-LMM 1.0 +268% 17.29 18.86 98.81 224 h 10 min
100 GB
Cohort scale eQTL analysis TensorQTL 1.0.2 Cis-eQTL mapping of 10.7 M SNPs against 18,373 genetic features in a cohort of 2,745 individuals (in-house). 2.04 2.22 11.7 1 h 14 min Nonlinearly with the number of traits or the sample size.
192 GB
LIMIX 2.0.3 +9,256% 190.73 208.07 1,089.9 9,705 h
41–221 GB
Single cis-eQTL gene mapping TensorQTL Cis-eQTL mapping one gene from skeletal muscle in GTEx (v6p) (Taylor-Weiner et al. 2019). 0.00001 0.00001 0.00004 0.11 s
52 GB
FastQTL +2,681% 0.0002 0.0002 0.001 30 s
52 GB
Molecular dynamics simulation AMBER 18 Simulation of a Satellite Tobacco Mosaic Virus with 1,066,628 atoms for 100 nsa (NAMD Performance n.d.; The Pmemd.Cuda GPU Implementation n.d.). 18 19 102 75 h  
(b)
NAMD 2.13 +433% 95 104 544 400 h
(b)
Molecular docking Glide 57111 Molecular docking of four DUD systems, scaled to 1 m ligands (Ruiz-Carmona et al. 2014) 13 14 74 1,027 h 47 min  
0.05 GB
rDock +1,092% 154 168 878 12,250 h
0.05 GB
AutoDock Vina +3,886% 514 561 2,938 40,972 h
0.05 GB

Loïc Lannelongue et al, in 2021, in Green Algorithms: Quantifying the Carbon Footprint of Computation propose a carbon footprint calculator : https://calculator.green-algorithms.org/.

The processors in our cluster are not available in the list, so if you want to use it for your jobs on our cluster you can select “other” and then set the TDP (Thermal Design Power) per core to 3.5. What’s more, the PUE (Power Usage Efficiency) of the datacentre hosting our machines is 1.4.