How to loop over multiple folders to concatenate FastQ files?

Question

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.

cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz

So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing. My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file? Surely, I cannot just concatenate one by one by going into each sample folder.

Please give some helpful tips. Thank you.
The folder structure is the following:

/data/Sample_1/....._525_1_fq.gz    /....._525_2_fq.gz    /....._526_1_fq.gz        /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz    /....._580_2_fq.gz    /....._589_1_fq.gz        /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz    /....._690_2_fq.gz    /....._645_1_fq.gz        /....._645_2_fq.gz

Below I have attached a screenshot of the folder structure.

Folder structure

I suppose you want to concatenate files which share the same sample ID. Which portion of the filename indicates the sample ID? I'm afraid your description and example filenames do not provide enough information which files should be merged and which should be kept separated. BR. — tshiono, Feb 23 '22 at 23:46
It would be easier if you put a screenshot or an output from tree command. Hard to follow for me. — Supertech, Feb 24 '22 at 02:17
Hello @tshiono @Supertech I have added the structure of the folder to the post. Please have a look. The ID starting with `C` is the sample ID. — Anik Dutta, Feb 24 '22 at 07:36
Thank you for providing the file structure. Taking `C077` directory for example, I assume you want to concatenate `V350028825_L04_*_1.fq.gz`'s into `sample_R1.fq.gz` and `V350028825_L04_*_2.fq.gz`'s into `sample_R2.fq.gz`. Am I right? — tshiono, Feb 24 '22 at 07:50
Yes to combine every `_1.fq.gz` and `_2.fq.gz` files into `sample_R1.fq.gz` and `sample_R2.fq.gz` — Anik Dutta, Feb 24 '22 at 08:13

tshiono · Accepted Answer · 2022-02-24T09:21:23.347

0

Based on the provided file structure, would you please try:

#!/bin/bash

for d in Raw2/C*/; do
(
    cd "$d"
    id=${d%/}; id=${id##*/}             # extract ID from the directory name
    cat V*_1.fq.gz > "${id}_R1.fq.gz"
    cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done

The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
The variable id is assigned to the ID extracted from the directory name.
cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.

edited Feb 24 '22 at 09:21

answered Feb 24 '22 at 08:19

tshiono

21,248
2
14
22

1

Thank you very much. I will try this ASAP and let you know. – Anik Dutta Feb 24 '22 at 08:23
I have one question. How does this code output the `fq.gz` files with the sample ID which starts with `C`? `cat V*_1.fq.gz > sample_R1.fq.gz` – Anik Dutta Feb 24 '22 at 08:58
Do you want to prepend the ID to the concatenated filename such as `C077_R1.fq.gz`? – tshiono Feb 24 '22 at 09:04
Yes exactly. `C077` and so on are the sample IDs by which the folders are named. – Anik Dutta Feb 24 '22 at 09:05
Thank you for the feedback. I've updated my answer accordingly. – tshiono Feb 24 '22 at 09:21
Thank you for the modification. I will let you know later if the code works or not. – Anik Dutta Feb 24 '22 at 10:23
1

Hi @tshiono the code worked like a charm! Many many thanks to you for saving a lot of my time. – Anik Dutta Feb 25 '22 at 15:22

How to loop over multiple folders to concatenate FastQ files?

1 Answers1