1

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.

cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz

So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing. My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file? Surely, I cannot just concatenate one by one by going into each sample folder.

Please give some helpful tips. Thank you.
The folder structure is the following:

/data/Sample_1/....._525_1_fq.gz    /....._525_2_fq.gz    /....._526_1_fq.gz        /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz    /....._580_2_fq.gz    /....._589_1_fq.gz        /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz    /....._690_2_fq.gz    /....._645_1_fq.gz        /....._645_2_fq.gz

Below I have attached a screenshot of the folder structure.

Folder structure

Anik Dutta
  • 29
  • 6
  • I suppose you want to concatenate files which share the same sample ID. Which portion of the filename indicates the sample ID? I'm afraid your description and example filenames do not provide enough information which files should be merged and which should be kept separated. BR. – tshiono Feb 23 '22 at 23:46
  • It would be easier if you put a screenshot or an output from tree command. Hard to follow for me. – Supertech Feb 24 '22 at 02:17
  • Hello @tshiono @Supertech I have added the structure of the folder to the post. Please have a look. The ID starting with `C` is the sample ID. – Anik Dutta Feb 24 '22 at 07:36
  • Thank you for providing the file structure. Taking `C077` directory for example, I assume you want to concatenate `V350028825_L04_*_1.fq.gz`'s into `sample_R1.fq.gz` and `V350028825_L04_*_2.fq.gz`'s into `sample_R2.fq.gz`. Am I right? – tshiono Feb 24 '22 at 07:50
  • Yes to combine every `_1.fq.gz` and `_2.fq.gz` files into `sample_R1.fq.gz` and `sample_R2.fq.gz` – Anik Dutta Feb 24 '22 at 08:13

1 Answers1

0

Based on the provided file structure, would you please try:

#!/bin/bash

for d in Raw2/C*/; do
(
    cd "$d"
    id=${d%/}; id=${id##*/}             # extract ID from the directory name
    cat V*_1.fq.gz > "${id}_R1.fq.gz"
    cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
  • The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
  • The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
  • The variable id is assigned to the ID extracted from the directory name.
  • cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.
tshiono
  • 21,248
  • 2
  • 14
  • 22