Deal with FASTQ Files
How to launch the same software on all my FASTQ files?
We will launch fastqc
with 2 CPU on the 30 fastq files:
#!/bin/bash
#
#SBATCH --array 0-29 # 30 jobs
#SBATCH --cpus-per-task 2
module load fastqc/0.11.9
INPUTS=(../fastq/*fastq.gz)
fastqc -f fastq -t $SLURM_CPUS_PER_TASK -o . ${INPUTS[$SLURM_ARRAY_TASK_ID]}
How to launch the same software on my pair-end FASTQ files?
Strategy #1: with a sample metadata file
It can be interesting to have a sample metadata file to describe your samples, files, conditions ... especially for statistics purpose.
samplemetadata.tsv
wt1 | wt1_R1.fastq.gz | wt1_R2.fastq.gz | wt |
wt2 | wt2_R1.fastq.gz | wt2_R2.fastq.gz | wt |
treat1 | treat1_R1.fastq.gz | female_wt1_R2.fastq.gz | treat |
treat2 | treat2_R1.fastq.gz | female_wt2_R2.fastq.gz | treat |
#!/bin/bash
#
#SBATCH --mem 70GB
#SBATCH --cpus-per-task 30
#SBATCH --array=1-16
module load star/2.7.5a
module load samtools/1.10
# INPUTS PATH
DIRfastq="../fastq/"
GENOME="/shared/bank/homo_sapiens/GRCh38/"
STARINDEX="$GENOME/star-2.7.5a/"
GTF="$GENOME/gtf/Homo_sapiens.GRCh38.101.gtf"
SAMPLEMETADATAFILE=samplemetadata.tsv
SAMPLE=$(awk "NR==$SLURM_ARRAY_TASK_ID {print $1}" $SAMPLEMETADATAFILE)
R1=$(awk "NR==$SLURM_ARRAY_TASK_ID {print $2}" $SAMPLEMETADATAFILE)
R2=$(awk "NR==$SLURM_ARRAY_TASK_ID {print $3}" $SAMPLEMETADATAFILE)
mkdir -p $SAMPLE
STAR --genomeDir $STARINDEX --runThreadN $SLURM_CPUS_PER_TASK --readFilesCommand zcat --readFilesIn ${DIRfastq}/${R1} ${DIRfastq}/${R2} --sjdbGTFfile $GTF --outFileNamePrefix ${SAMPLE}/${SAMPLE}_ --outSAMtype BAM SortedByCoordinate
samtools index ${SAMPLE}/${SAMPLE}_Aligned.sortedByCoord.out.bam
Explanation:
awk "NR==$SLURM_ARRAY_TASK_ID {print $2}" samplemetadata.tsv
awk
will return the SLURM_ARRAY_TASK_ID
th row and the 2nd column.
Strategy #2: without a sample metadata file
#!/bin/bash
#
#SBATCH --mem 70GB
#SBATCH --cpus-per-task 30
#SBATCH --array=1-16
module load star/2.7.5a
module load samtools/1.10
# INPUTS PATH
DIRfastq="../fastq/"
GENOME="/shared/bank/homo_sapiens/GRCh38/"
STARINDEX="$GENOME/star-2.7.5a/"
GTF="$GENOME/gtf/Homo_sapiens.GRCh38.101.gtf"
INPUT=$(ls $DIRfastq/*_R1.fastq.gz | awk "NR==$SLURM_ARRAY_TASK_ID")
SAMPLE=$(basename $INPUT _R1.fastq.gz)
mkdir -p $SAMPLE
STAR --genomeDir $STARINDEX --runThreadN $SLURM_CPUS_PER_TASK --readFilesCommand zcat --readFilesIn ${DIRfastq}/${SAMPLE}_R1.fastq.gz ${DIRfastq}/${SAMPLE}_R2.fastq.gz --sjdbGTFfile $GTF --outFileNamePrefix ${SAMPLE}/${SAMPLE}_ --outSAMtype BAM SortedByCoordinate
samtools index ${SAMPLE}/${SAMPLE}_Aligned.sortedByCoord.out.bam
Explanation:
Here is the magic:
INPUT=$(ls $DIRfastq/*_R1.fastq.gz | awk "NR==$SLURM_ARRAY_TASK_ID")
SAMPLE=$(basename $INPUT _R1.fastq.gz)
The Magic to get ONE sample name without R1 and R2
- 1 ls $DIRfastq/*_R1.fastq.gz
will list all the R1 file so one file per sample
- 2 awk "NR==$SLURM_ARRAY_TASK_ID"
will get the Nth sample of the sample list generated by ls
- 3 basename $INPUT _R1.fastq.gz
will remove the input path and "extension"
So pratically:
$ ls ../fastq/
sample1_R1.fastq.gz sample1_R2.fastq.gz sample2_R1.fastq.gz sample2_R2.fastq.gz
- Round1:
$SLURM_ARRAY_TASK_ID
== 1$INPUT
== "sample1_R1.fastq.gz"$SAMPLE
== "sample1"- Round2:
$SLURM_ARRAY_TASK_ID
== 2$INPUT
== "sample2_R1.fastq.gz"$SAMPLE
== "sample2"