Bioinfo tips

If you want to publish results on short reads you have to publish your data.

  • For genomic data you can submit it in the Short Read Archive (SRA).
The european mirror of SRA is the ENA, have a look to the video in the paragraph "Submitting public access data using SRA Webin". Follow those step : - create your account - fill all the needed information - compute the md5 of your raw file or ask to the bioinformatics platform if your data are in NG6. md5 is a unique string generated fron the content of a file. To compute it on windows use WinMD5, on linux use in the terminal the command md5sum /path/to/file - transfer your data when it's required or ask to the bioinformatics platform if your data are in NG6.

with a SRA run id (eg SRR1045854) type :

    • create ncbi directory:
      mkdir ~/work/ncbi
    • create link :
      ln -s ~/work/ncbi ~/ncbi
    • Load module:
      module load bioinfo/sratoolkit.2.8.2-1
    • download data :
      prefetch SRR1045854
      Your data will be available ~/work/ncbi/public/sra/SRR1045854.sra
    • convert sra to fastq
      fastq-dump ~/wor/ncbi/public/sra/SRR1045854.sra

After conversion, remove .sra file to save disk space.
You should use file from ENA as the protocol is much much much faster and files are already in fastq format.

  • Find your dataset at https://www.ebi.ac.uk/ena/.
    Here is an example
  • Find the FASTQ url in columns "FASTQ files (FTP)", in the example it's : ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz
  • On Genotoul cluster type :
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz .
  • Or with aspera
    /tools/aspera/bin/ascp -QT -P33001 -i /tools/aspera/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz .

For paired file please use perl syntax.

1) search your module
search_module fastqc

2) build your command which need to be in one line, so separate commands with ";"
module load bioinfo/FastQC_v0.11.7; fastqc MT_rep1_1_Ch6.fastq.gz

3) test it on one sample and stop execution (ctrl + c) if syntax is correct

4) build the bash loop:
for i in `ls *.fastq.gz`
do
echo "module load bioinfo/FastQC_v0.11.7; fastqc $i"
done

The value of the variable $i will be consequently equal to the filenames resulted from 'ls *.fastq.gz'
WARNING: le command ls *.fastq.gz MUST be between character ` which is the quote of keyboard number 7

5) generate the this in oneline
TIPS: go "up" in your history to get the following line, then add the redirection to a file
for i in `ls *.fastq.gz`; do echo "module load bioinfo/FastQC_v0.11.7; fastqc $i"; done > mescommandesfastqc.sh

6) check file content
more mescommandesfastqc.sh

7) launch parallel command
sarray -J fastqcjob -o %j.out -e %j.err -t 01:00:00 --mem=2G --mail-type=BEGIN,END,FAIL mescommandesfastqc.sh

%j correspond to the job number during execution.
adapt parameters to your needs.

9) check execution
squeue -u username

Here is an example for a STAR command

1) search your module
search_module STAR

2) build your command which need to be in one line, so separate commands with ";"
module load XXX; STAR --genomeDir star-index --readFilesIn samplename.R1.fastq.gz samplename.R2.fastq.gz --outFileNamePrefix samplename [options]...

3) test it on one sample and stop execution (ctrl + c) if syntax is correct

4) find your pattern to list all your fastq.gz files:
ls /directory/*.fastq.gz | paste - -
The paste command with two dash, take line two by two. So it's generate one line per pair files.

the result of this command, for n samples would be
/directory/sample1.R1.fastq.gz /directory/sample1.R2.fastq.gz
/directory/sample2.R1.fastq.gz /directory/sample2.R2.fastq.gz
/directory/sample3.R1.fastq.gz /directory/sample3.R2.fastq.gz
...

5) use oneline perl script to print a string (command) per pair files.

ls /directory/*.fastq.gz | paste - - | perl -lane '($ech)=$F[0]=~/(.*).fastq.gz/; print "module load XXX; STAR --genomeDir star-index --readFilesIn $F[0] $F[1] --outFileNamePrefix $ech ...."' > mesSTARcommandes.sh

Explanations:

  • perl -lane : for each line, remove 'enter' at end of line, split line by spaces and initialize an array: $F[]
    so in our example (outputs from 4)
  • $F[0] is the first column corresponding to R1 filename
  • $F[1] is the second column corresponding to R2 filename
  • ($ech)=$F[0]=~/(.*).fastq.gz/
    from R1 file name ($F[0]) retrieve the sample name (value in parenthesis in regexp) into a variable '$ech'
  • print the command line with variables: $F[O] (file R1), $F[1] (file R2), $ech (sample name)
    print "module load XXX; STAR --genomeDir star-index --readFilesIn $F[0] $F[1] --outFileNamePrefix $ech ...."
  • once the lines generated correspond to what you want, redirect output into a file (>)
    > mescommandes.sh

6) check the content of your file
more mescommands.sh
and test a line of your file to check the command line syntax

7) launch your commands in parallel on the cluster:

sarray -J fastqcjob -o %j.out -e %j.err -t 01:00:00 --mem=2G --mail-type=BEGIN,END,FAIL mescommandes.sh

%j correspond to the job number during execution.
adapt parameters to your needs.

8) check execution
squeue -u username