If you want to publish results on short reads you have to publish your data.
- For genomic data you can submit it in the Short Read Archive (SRA).
- For RNAseq data you can submit it in MAGE-TAG: http://www.ebi.ac.uk/cgi-bin/microarray/magetab.cgi.
with a SRA run id (eg SRR1045854) type :
-
- create ncbi directory:
mkdir ~/work/ncbi
- create link :
ln -s ~/work/ncbi ~/ncbi
- Load module:
module load bioinfo/sratoolkit.2.8.2-1
- download data :
prefetch SRR1045854
Your data will be available ~/work/ncbi/public/sra/SRR1045854.sra - convert sra to fastq
fastq-dump ~/wor/ncbi/public/sra/SRR1045854.sra
- create ncbi directory:
After conversion, remove .sra file to save disk space.
You should use file from ENA as the protocol is much much much faster and files are already in fastq format.
- Find your dataset at https://www.ebi.ac.uk/ena/.
Here is an example - Find the FASTQ url in columns "FASTQ files (FTP)", in the example it's :
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz
- On Genotoul cluster type :
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz .
- Or with aspera
/tools/aspera/bin/ascp -QT -P33001 -i /tools/aspera/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz .
For paired file please use perl syntax.
1) search your module
search_module fastqc
2) build your command which need to be in one line, so separate commands with ";"
module load bioinfo/FastQC_v0.11.7; fastqc MT_rep1_1_Ch6.fastq.gz
3) test it on one sample and stop execution (ctrl + c) if syntax is correct
4) build the bash loop:
for i in `ls *.fastq.gz`
do
echo "module load bioinfo/FastQC_v0.11.7; fastqc $i"
done
The value of the variable $i will be consequently equal to the filenames resulted from 'ls *.fastq.gz'
WARNING: le command ls *.fastq.gz
MUST be between character `
which is the quote of keyboard number 7
5) generate the this in oneline
TIPS: go "up" in your history to get the following line, then add the redirection to a file
for i in `ls *.fastq.gz`; do echo "module load bioinfo/FastQC_v0.11.7; fastqc $i"; done > mescommandesfastqc.sh
6) check file content
more mescommandesfastqc.sh
7) launch parallel command
sarray -J fastqcjob -o %j.out -e %j.err -t 01:00:00 --mem=2G --mail-type=BEGIN,END,FAIL mescommandesfastqc.sh
%j
correspond to the job number during execution.
adapt parameters to your needs.
9) check execution
squeue -u username
Here is an example for a STAR command
1) search your module
search_module STAR
2) build your command which need to be in one line, so separate commands with ";"
module load XXX; STAR --genomeDir star-index --readFilesIn samplename.R1.fastq.gz samplename.R2.fastq.gz --outFileNamePrefix samplename [options]...
3) test it on one sample and stop execution (ctrl + c) if syntax is correct
4) find your pattern to list all your fastq.gz files:
ls /directory/*.fastq.gz | paste - -
The paste
command with two dash, take line two by two. So it's generate one line per pair files.
the result of this command, for n samples would be
/directory/sample1.R1.fastq.gz /directory/sample1.R2.fastq.gz
/directory/sample2.R1.fastq.gz /directory/sample2.R2.fastq.gz
/directory/sample3.R1.fastq.gz /directory/sample3.R2.fastq.gz
...
5) use oneline perl script to print a string (command) per pair files.
ls /directory/*.fastq.gz | paste - - | perl -lane '($ech)=$F[0]=~/(.*).fastq.gz/; print "module load XXX; STAR --genomeDir star-index --readFilesIn $F[0] $F[1] --outFileNamePrefix $ech ...."' > mesSTARcommandes.sh
Explanations:
perl -lane
: for each line, remove 'enter' at end of line, split line by spaces and initialize an array:$F[]
so in our example (outputs from 4)$F[0]
is the first column corresponding to R1 filename$F[1]
is the second column corresponding to R2 filename($ech)=$F[0]=~/(.*).fastq.gz/
from R1 file name ($F[0]
) retrieve the sample name (value in parenthesis in regexp) into a variable '$ech'- print the command line with variables:
$F[O] (file R1), $F[1] (file R2), $ech (sample name)
print "module load XXX; STAR --genomeDir star-index --readFilesIn $F[0] $F[1] --outFileNamePrefix $ech ...."
- once the lines generated correspond to what you want, redirect output into a file (
>
)
> mescommandes.sh
6) check the content of your file
more mescommands.sh
and test a line of your file to check the command line syntax
7) launch your commands in parallel on the cluster:
sarray -J fastqcjob -o %j.out -e %j.err -t 01:00:00 --mem=2G --mail-type=BEGIN,END,FAIL mescommandes.sh
%j
correspond to the job number during execution.
adapt parameters to your needs.
8) check execution
squeue -u username