Deal with FASTQ Files

How to launch the same software on all my FASTQ files?

We will launch fastqc with 2 CPU on the 30 fastq files:

#!/bin/bash
#
#SBATCH --array 0-29  # 30 jobs
#SBATCH --cpus-per-task 2

module load fastqc/0.11.9

INPUTS=(../fastq/*fastq.gz)

fastqc -f fastq -t $SLURM_CPUS_PER_TASK -o . ${INPUTS[$SLURM_ARRAY_TASK_ID]}

How to launch the same software on my pair-end FASTQ files?

Strategy #1: with a sample metadata file

It can be interesting to have a sample metadata file to describe your samples, files, conditions ... especially for statistics purpose.

samplemetadata.tsv

wt1  wt1_R1.fastq.gz wt1_R2.fastq.gz wt
wt2  wt2_R1.fastq.gz wt2_R2.fastq.gz wt
treat1 treat1_R1.fastq.gz female_wt1_R2.fastq.gz treat
treat2 treat2_R1.fastq.gz female_wt2_R2.fastq.gz treat
#!/bin/bash
#
#SBATCH --mem 70GB
#SBATCH --cpus-per-task 30
#SBATCH --array=1-16

module load star/2.7.5a
module load samtools/1.10

# INPUTS PATH
DIRfastq="../fastq/"
GENOME="/shared/bank/homo_sapiens/GRCh38/"
STARINDEX="$GENOME/star-2.7.5a/"
GTF="$GENOME/gtf/Homo_sapiens.GRCh38.101.gtf"

SAMPLEMETADATAFILE=samplemetadata.tsv

SAMPLE=$(awk "NR==$SLURM_ARRAY_TASK_ID {print $1}" $SAMPLEMETADATAFILE)
R1=$(awk "NR==$SLURM_ARRAY_TASK_ID {print $2}" $SAMPLEMETADATAFILE)
R2=$(awk "NR==$SLURM_ARRAY_TASK_ID {print $3}" $SAMPLEMETADATAFILE)

mkdir -p $SAMPLE
STAR --genomeDir $STARINDEX --runThreadN $SLURM_CPUS_PER_TASK --readFilesCommand zcat --readFilesIn ${DIRfastq}/${R1} ${DIRfastq}/${R2} --sjdbGTFfile $GTF --outFileNamePrefix ${SAMPLE}/${SAMPLE}_  --outSAMtype BAM SortedByCoordinate  
samtools index ${SAMPLE}/${SAMPLE}_Aligned.sortedByCoord.out.bam

Explanation:

awk "NR==$SLURM_ARRAY_TASK_ID {print $2}" samplemetadata.tsv

awk will return the SLURM_ARRAY_TASK_IDth row and the 2nd column.

Strategy #2: without a sample metadata file

#!/bin/bash
#
#SBATCH --mem 70GB
#SBATCH --cpus-per-task 30
#SBATCH --array=1-16

module load star/2.7.5a
module load samtools/1.10

# INPUTS PATH
DIRfastq="../fastq/"
GENOME="/shared/bank/homo_sapiens/GRCh38/"
STARINDEX="$GENOME/star-2.7.5a/"
GTF="$GENOME/gtf/Homo_sapiens.GRCh38.101.gtf"


INPUT=$(ls $DIRfastq/*_R1.fastq.gz | awk "NR==$SLURM_ARRAY_TASK_ID")
SAMPLE=$(basename $INPUT _R1.fastq.gz)

mkdir -p $SAMPLE
STAR --genomeDir $STARINDEX --runThreadN $SLURM_CPUS_PER_TASK --readFilesCommand zcat --readFilesIn ${DIRfastq}/${SAMPLE}_R1.fastq.gz ${DIRfastq}/${SAMPLE}_R2.fastq.gz --sjdbGTFfile $GTF --outFileNamePrefix ${SAMPLE}/${SAMPLE}_  --outSAMtype BAM SortedByCoordinate  
samtools index ${SAMPLE}/${SAMPLE}_Aligned.sortedByCoord.out.bam

Explanation:

Here is the magic:

INPUT=$(ls $DIRfastq/*_R1.fastq.gz | awk "NR==$SLURM_ARRAY_TASK_ID")
SAMPLE=$(basename $INPUT _R1.fastq.gz)

The Magic to get ONE sample name without R1 and R2 - 1 ls $DIRfastq/*_R1.fastq.gz will list all the R1 file so one file per sample - 2 awk "NR==$SLURM_ARRAY_TASK_ID" will get the Nth sample of the sample list generated by ls - 3 basename $INPUT _R1.fastq.gz will remove the input path and "extension"

So pratically:

$ ls ../fastq/
sample1_R1.fastq.gz sample1_R2.fastq.gz sample2_R1.fastq.gz sample2_R2.fastq.gz
  • Round1:
  • $SLURM_ARRAY_TASK_ID == 1
  • $INPUT == "sample1_R1.fastq.gz"
  • $SAMPLE == "sample1"
  • Round2:
  • $SLURM_ARRAY_TASK_ID == 2
  • $INPUT == "sample2_R1.fastq.gz"
  • $SAMPLE == "sample2"