-1

I have filenames like the following:

fastqs/hgmm_100_S1_L001_R1_001.fastq.gz
fastqs/hgmm_100_S1_L002_R1_001.fastq.gz
fastqs/hgmm_100_S1_L003_R1_001.fastq.gz

fastqs/hgmm_100_S1_L001_R2_001.fastq.gz
fastqs/hgmm_100_S1_L002_R2_001.fastq.gz
fastqs/hgmm_100_S1_L003_R2_001.fastq.gz

And I want to merge them into the groups shown above, allowing LXXX to be merged.

I can do it like the following:

cat fastqs/hgmm_100_S1_L00?_R1_001.fastq.gz > data/hgmm_100_S1_R1_001.fastq.gz
cat fastqs/hgmm_100_S1_L00?_R2_001.fastq.gz > data/hgmm_100_S1_R2_001.fastq.gz

But this requires me to hard code each of the file groups in. How can I set it up such that it merges all of the L values into a group and outputs a file that is the same as the input file names, just without the L?

Thanks, Jack

EDIT:

Sorry for not including this in original post, but what if I had something like:

fastqs/hgmm_100_S1_L001_R1_001.fastq.gz
fastqs/hgmm_100_S1_L002_R1_001.fastq.gz
fastqs/hgmm_100_S1_L003_R1_001.fastq.gz

fastqs/hgmm_200_S1_L001_R2_001.fastq.gz
fastqs/hgmm_200_S1_L002_R2_001.fastq.gz
fastqs/hgmm_200_S1_L003_R2_001.fastq.gz

(Only change is the very beginning (100 -> 200))

How would this work? Essentially I want to merge these files as long as all parts of the name except for L??? is identical.

Jack Arnestad
  • 1,845
  • 13
  • 26
  • @Socowi My bad! I will fix this. – Jack Arnestad Oct 21 '18 at 13:41
  • You have to hardcode *something* and it's not clear what the boundaries are. Would it be fair to say that you want the outer loop to iterate over unique sequences of `L` and three digits? – tripleee Oct 21 '18 at 13:44
  • @tripleee Yes, the L+3 digits is allowed to be different across filenames, but the rest of the filename must be identical for merging criteria. – Jack Arnestad Oct 21 '18 at 13:55
  • It's customary, when you're asking for help with your code, to include your code. What have you tried so far? What were your results? – ghoti Oct 21 '18 at 22:19

2 Answers2

2

If the pattern _L###_ exists only in that one part of the filename, you might try something like this:

#!/usr/bin/env bash

# Define an associative array. Requires bash 4+
declare -A a

# Use extended glob notation. Read the man page or this.
shopt -s extglob

# Collect the file patterns by writing indexes in the array.
for f in fastqs/*_L+([0-9])_*.fastq.gz; do
  a["${f/_L+([0-9])_/_*_}"]=1
done

# And finally, gather your files.
for f in "${!a[@]}"; do
  # Strip any existing directory part of the filename to build our target
  target="data/${f##*/}"
  # Concatenate files matching the glob into our intended target
  cat $f > "${target/[*]_/}"
done
  • We use Pattern Substitution to convert the variable part of each filespec into a glob.
  • We use the index of an associative array because it makes it easy to keep a unique list.
  • ${! lets us step through an array's indices rather than its values.
ghoti
  • 45,319
  • 8
  • 65
  • 104
  • How do I write the out files to a different directory? /data in the example. Thanks! – Jack Arnestad Oct 21 '18 at 16:16
  • You didn't include any code that would attempt to put files into a different directory. But you can strip off the existing directory portion of `$f` using Parameter Expansion, then prepend the new target directory. It requires an additional line for the variable assignment of course, since you can only do one Parameter Expansion at a time. – ghoti Oct 21 '18 at 22:17
0

You can do the grouping on the fly. Iterate over all files and append them to their grouped file. * and ? expand in a sorted way, so the order should be correct.

cd fastqs
for f in *_L???_*fastq.gz; do
    cat "$f" >> "../data/${f/_L???_/_}"
done
cd ..

Since files are always appended, you should clear your data/ directory before running this command again.

Socowi
  • 25,550
  • 3
  • 32
  • 54
  • Thanks so much for this! There is one case I overlooked when writing the question, and that is that the beginning of the name is not necessarily consistent. Could you suggest how to deal with the case I added to the edit above? Thanks again :) – Jack Arnestad Oct 21 '18 at 14:14
  • Then you are best of with the last of the three alternatives. You simply have to adjust the glob pattern, such that all files you want to include are matched. See edit. – Socowi Oct 21 '18 at 15:18