Being mindful of the environmental impact of our activity is crucial in order to aim for a sustainable future. Indeed, the field of big data is expanding with the implied consequences – e.g. energy consumption, data storage and important carbon footprint.
Information technology and computing clusters have a substantial environmental impact, primarily through their intensive consumption of primary resources. These systems rely heavily on rare and non-renewable resources such as rare earth metals, which are essential for manufacturing semiconductors and high-performance computing components. The extraction and processing of these resources can lead to habitat destruction, pollution, and ecosystem disruption. Additionally, the energy-intensive nature of data centers and clusters, required for the processing and storage of data, places significant demands on water resources for cooling purposes.
Here are some good practices to save time and reduce your carbon footprint.
Firstly,
Considering how urgent the situation is, making the research stakeholders aware of these issues is key to achieve the necessary sustainable goals for humanity to survive.
-
For extensive analyses, consider whether launching the analysis is truly necessary.
-
If you have a substantial number of similar processes (e.g., numerous fastq files to analyze),
-
launch the process on a single file to assess your memory, CPU, and time requirements.
-
Utilize the ‘seff’ command to gather information about the completed job,
-
then launch the job array with the adjusted parameters.
-
- If you are using a workflow manager,
-
- review the default resources in the configuration and reduce them if necessary. You can also employ the aforementioned approach to fine-tune the values.
- Pause processing at quality check steps to perform them only if the quality is sufficient for the next step.
——
Here is the table 1 “Carbon Footprint of a Range of Bioinformatic Tasks.” of the article titled “The Carbon Footprint of Bioinformatics,” published in Molecular Biology and Evolution, Volume 39, Issue 3, March 2022:
This article investigates the carbon footprint associated with bioinformatics, a field that utilizes computational tools to analyze and interpret biological data. The authors explore various stages of the bioinformatics process, including data collection, analysis, storage, and sharing.
The study assesses how these different stages contribute to greenhouse gas emissions and examines factors influencing the carbon footprint of bioinformatics. The authors also discuss the implications of their findings, emphasizing the importance of considering environmental impact in bioinformatics research.
In summary, the article highlights the need to be aware of the carbon footprint associated with bioinformatics and suggests avenues for reducing its environmental impact.
Task | Tool | Version | Details about the Experiments | Carbon Footprint
|
Tree-months | km in a Car (EU) | Running Time and Memory | Approximate Scaling (if known) | |
---|---|---|---|---|---|---|---|---|---|
Increase (%) | kgCO2e | ||||||||
Genome scaffolding | SSPACE | 2.0 | Scaffolding 2.4 million long reads from human chromosome 14 (Hunt et al. 2014). | — | 0.0010 | 0.0011 | 0.01 | 3 min 21 s | Linearly with number of reads. |
30 GB | |||||||||
SOAPdenovo2 | r223 | +45% | 0.0015 | 0.0016 | 0.01 | 4 min 52 s | |||
30 GB | |||||||||
SGA | 0.9.43 | +2,752% | 0.029 | 0.032 | 0.17 | 1 h 35 min | |||
30 GB | |||||||||
Genome scaffolding | SSPACE | 2.0 | Scaffolding 23 million short reads from human chromosome 14 (Hunt et al. 2014). | — | 0.0027 | 0.0029 | 0.02 | 8 min 40 s | |
30 GB | |||||||||
SOAPdenovo2 | r223 | +34% | 0.0036 | 0.0039 | 0.02 | 1 min 38 s | |||
30 GB | |||||||||
SGA | 0.9.43 | +4,801% | 0.13 | 0.14 | 0.74 | 7 h 05 min | |||
30 GB | |||||||||
Genome assembly | Abyss | 2.0 | De novo assembly of a human genome from Illumina sequencing reads (Jackman et al. 2017). | — | 11 | 12 | 61 | 20 h | |
34 GB | |||||||||
MEGAHIT | 1.0.6 | +42% | 15 | 16 | 86 | 26 h | |||
197 GB | |||||||||
Metagenome assembly | MetaVelvet k101 | 1.2.01 | Metagenome assembly from 100 soil samples (Vollmers et al. 2017). | — | 14 | 16 | 82 | 1 h 06 min | |
130 GB | |||||||||
MEGAHIT | 1.0.3 | +438% | 77 | 84 | 439 | 15 h 36 min | |||
12 GB | |||||||||
metaSPAdes | 3.8.0 | +1,206% | 186 | 203 | 1,065 | 29 h 24 min | |||
60 GB | |||||||||
Metagenome classification (short read) | Kraken2 | 2.0.7 | Metagenomic classification of 5 Gb of randomly sampled reads from Zymo mock community (batch ZRC190633), containing yeast, Gram-negative, and positive bacteria (Dilthey et al. 2019) | — | 0.0052 | 0.0057 | 0.03 | 20 min | Linearly with number of reads. |
21 GB | |||||||||
Centrifuge | 1.0.4 | +141% | 0.013 | 0.014 | 0.07 | 58 min | |||
12 GB | |||||||||
Kraken/Bracken | 0.10.5/1.0.0 | +1,650% | 0.092 | 0.10 | 0.52 | 1 h 40 min | |||
154 GB | |||||||||
Metagenome classification (long read) | MetaMaps | — | — | 18.25 | 19.91 | 104.27 | 209 h 53 min | ||
262 GB | |||||||||
Phylogenetics | BEAST/BEAGLE | 1.8.4/2.1.2 | Codon substitution modeling of extant carnivores and a pangolin group. Nucleotide substitution and phylogeographic modeling of Ebola virus genomes. See supplementary table 2, Supplementary Material online, for detailed results (Baele et al. 2019). | — | 0.012–0.30 | 0.013–0.33 | 0.069–1.72 | 3 min 30 s to 7 h 45 min | Power law with number of loci. |
2–8 GB | |||||||||
Phylogenetics | RAxml/ExaML, PhyML, IQ-TREE, FastTree | 8.2.0/3.0.17, 20160530 1.4.2, 2.1.9 | Over 670,000 tree inferences on about 45,000 single-gene alignments and supermatrices from 19 empirical phylogenomic data sets with thousands of genes and around 200 taxa. (Zhou et al. 2018) | — | 3565 | 3889 | 20,371 | 300,000 h | |
8 GB | |||||||||
Phylogenetics | ExaML | — | A 322-million-bp MULTIZ alignment of putatively orthologous genome regions across all species, comprising approximately 30% of an average assembled avian genome. This corresponded to the maximal orthologous sequence obtainable across all orders of Neoaves.(Jarvis et al. 2014) | — | 4372 | 4769 | 24,983 | 367,920 h | |
8 GB | |||||||||
RNA read alignment | HISAT2 | 2.0.0beta | Alignment of 10 million 100-base read pairs to Homo Sapiens hg19 genome (Baruzzo et al. 2017). | — | 0.0054 | 0.0059 | 0.031 | 1 min 48 s | Linearly with number of reads. |
5 GB | |||||||||
STAR | 2.5.0a | +78% | 0.0097 | 0.011 | 0.055 | 6 min 01 s | |||
35 GB | |||||||||
TopHat2 | 2.1.0 | +5,756% | 0.32 | 0.35 | 1.81 | 2 h 14 min | |||
16 GB | |||||||||
Novoalign | 3.02.13 | +17,926% | 0.98 | 1.07 | 5.58 | 32 h 12 min | |||
64 GB | |||||||||
RNA read alignment | HISAT2 | 2.0.0beta | Alignment of 10 million 100-base read pairs to Plasmodium falciparum genome (Baruzzo et al. 2017). | — | 0.0052 | 0.0057 | 0.030 | 1 min 44 s | |
1 GB | |||||||||
TopHat2 | 2.1.0 | +4,519% | 0.24 | 0.26 | 1.37 | 1 h 25 min | |||
13 GB | |||||||||
STAR | 2.5.0a | +7,025% | 0.37 | 0.40 | 2.11 | 2 h 27 min | |||
8 GB | |||||||||
Novoalign | 3.02.13 | +12,847% | 0.67 | 0.73 | 3.83 | 38 h 04 min | |||
21 GB | |||||||||
RNA-seq QC pipeline | FastQC, TrimGalore, bbmap/clumpify, and STAR | -/v0.6.0/-/v2.7.0e | Quality control analysis of raw reads quality of 392 samples from the Childhood Asthma Study (in-house). | — | 54.97 | 59.97 | 314.11 | 485 h 12 min | |
8 GB | |||||||||
Transcript isoform abundance estimation | Sailfish 1 core | 0.6.3 | Transcript isoform quantification of 100 million in silico reads generated from Flux Simulator with hg19 genome and GENCODE v19 annotation set (Kanitz et al. 2015) | — | 0.0081 | 0.0088 | 0.046 | 42 min | Linearly with the number of reads. |
7 GB | |||||||||
Sailfish 16 cores | +344% | 0.036 | 0.039 | 0.21 | 14 min | ||||
7 GB | |||||||||
Cufflinks 1 core | 2.1.1 | +451% | 0.045 | 0.049 | 0.26 | 3 h 30 min | |||
11 GB | |||||||||
Cufflinks 16 cores | +3,262% | 0.27 | 0.30 | 1.56 | 1 h 45 min | ||||
12 GB | |||||||||
RSEM 1 core | 1.2.18 | +6,982% | 0.57 | 0.63 | 3.28 | 47 h 10 min | |||
9 GB | |||||||||
RSEM 16 cores | +17,162% | 1.40 | 1.53 | 8.00 | 8 h 50 min | ||||
21 GB | |||||||||
GWAS | Bolt-LMM | 2.3 | Analyses of a single trait in UK Biobank (N = 500,000) (Loh et al. 2018) | — | 4.70 | 5.13 | 26.87 | 60 h 58 min | Linearly with number of variants. |
100 GB | |||||||||
Bolt-LMM | 1.0 | +268% | 17.29 | 18.86 | 98.81 | 224 h 10 min | |||
100 GB | |||||||||
Cohort scale eQTL analysis | TensorQTL | 1.0.2 | Cis-eQTL mapping of 10.7 M SNPs against 18,373 genetic features in a cohort of 2,745 individuals (in-house). | — | 2.04 | 2.22 | 11.7 | 1 h 14 min | Nonlinearly with the number of traits or the sample size. |
192 GB | |||||||||
LIMIX | 2.0.3 | +9,256% | 190.73 | 208.07 | 1,089.9 | 9,705 h | |||
41–221 GB | |||||||||
Single cis-eQTL gene mapping | TensorQTL | — | Cis-eQTL mapping one gene from skeletal muscle in GTEx (v6p) (Taylor-Weiner et al. 2019). | — | 0.00001 | 0.00001 | 0.00004 | 0.11 s | |
52 GB | |||||||||
FastQTL | — | +2,681% | 0.0002 | 0.0002 | 0.001 | 30 s | |||
52 GB | |||||||||
Molecular dynamics simulation | AMBER | 18 | Simulation of a Satellite Tobacco Mosaic Virus with 1,066,628 atoms for 100 nsa (NAMD Performance n.d.; The Pmemd.Cuda GPU Implementation n.d.). | — | 18 | 19 | 102 | 75 h | |
(b) | |||||||||
NAMD | 2.13 | +433% | 95 | 104 | 544 | 400 h | |||
(b) | |||||||||
Molecular docking | Glide | 57111 | Molecular docking of four DUD systems, scaled to 1 m ligands (Ruiz-Carmona et al. 2014) | — | 13 | 14 | 74 | 1,027 h 47 min | |
0.05 GB | |||||||||
rDock | — | +1,092% | 154 | 168 | 878 | 12,250 h | |||
0.05 GB | |||||||||
AutoDock Vina | — | +3,886% | 514 | 561 | 2,938 | 40,972 h | |||
0.05 GB |
Loïc Lannelongue et al, in 2021, in Green Algorithms: Quantifying the Carbon Footprint of Computation propose a carbon footprint calculator : https://calculator.green-algorithms.org/.
The processors in our cluster are not available in the list, so if you want to use it for your jobs on our cluster you can select “other” and then set the TDP (Thermal Design Power) per core to 3.5. What’s more, the PUE (Power Usage Efficiency) of the datacentre hosting our machines is 1.4.