2

Within a certain directory I have many directories containing a bunch of text files. I’m trying to write a script that concatenates only those files in each directory that have the string ‘R1’ in their filename into one file within that specific directory, and those that have ‘R2’ in another . This is what I wrote but it’s not working.

#!/bin/bash

for f in */*.fastq; do

    if grep 'R1' $f ; then
        cat "$f" >> R1.fastq
    fi

    if grep 'R2' $f ; then
        cat "$f" >> R2.fastq
    fi

done

I get no errors and the files are created as intended but they are empty files. Can anyone tell me what I’m doing wrong?

Thank you all for the fast and detailed responses! I think I wasn't very clear in my question, but I need the script to only concatenate the files within each specific directory so that each directory has a new file ( R1 and R2). I tried doing

cat /*R1*.fastq >*/R1.fastq 

but it gave me an ambiguous redirect error. I also tried Charles Duffy's for loop but looping through the directories and doing a nested loop to run though each file within a directory like so

for f in */; do
   for d in "$f"/*.fastq;do
     case "$d" in
       *R1*) cat "$d" >&3
       *R2*) cat "$d" >&4
     esac
   done 3>R1.fastq 4>R2.fastq
done

but it was giving an unexpected token error regarding ')'.

Sorry in advance if I'm missing something elementary, I'm still very new to bash.

Alon Gelber
  • 113
  • 7
  • `grep 'R1' $f` doesn't look for `R1` in the *name* of `$f`; it looks for `R1` in the *contents* of whichever set of filenames `$f` generates if split on characters in IFS (by default, newlines, tabs and spaces) after each piece is expanded as a glob. If you wanted to reliably look in the *contents* of the file named in the variable `$f`, then you'd need quotes, `grep R1 "$f"`; if you want to look at its name... well, the answers cover that. – Charles Duffy Jan 12 '17 at 22:28
  • Frankly, I'd argue that your updates are expansive enough to change the meaning of the question. Changing the meaning is fine before you have answers, but if your question already has answers that only make sense in the context of the original question, then you should ask a new question instead of rewriting. – Charles Duffy Jan 13 '17 at 18:00

3 Answers3

4

A Note To The Reader

Please review edit history on the question in considering this answer; several parts have been made less relevant by question edits.

One cat Per Output File

For the purpose at hand, you can probably just let shell globbing do all the work (if R1 or R2 will be in the filenames, as opposed to the directory names):

set -x # log what's happening!
cat */*R1*.fastq >R1.fastq
cat */*R2*.fastq >R2.fastq

One find Per Output File

If it's a really large number of files, by contrast, you might need find:

find . -mindepth 2 -maxdepth 2 -type f -name '*R1*.fastq' -exec cat '{}' + >R1.fastq
find . -mindepth 2 -maxdepth 2 -type f -name '*R2*.fastq' -exec cat '{}' + >R2.fastq

...this is because of the OS-dependent limit on command-line length; the find command given above will put as many arguments onto each cat command as possible for efficiency, but will still split them up into multiple invocations where otherwise the limit would be exceeded.


Iterate-And-Test

If you really do want to iterate over everything, and then test the names, consider a case statement for the job, which is much more efficient than using grep to check just one line:

for f in */*.fastq; do
  case $f in
    *R1*) cat "$f" >&3
    *R2*) cat "$f" >&4
  esac
done 3>R1.fastq 4>R2.fastq

Note the use of file descriptors 3 and 4 to write to R1.fastq and R2.fastq respectively -- that way we're only opening the output files once (and thus truncating them exactly once) when the for loop starts, and reusing those file descriptors rather than re-opening the output files at the beginning of each cat. (That said, running cat once per file -- which find -exec {} + avoids -- is probably more overhead on balance).


Operating Per-Directory

All of the above can be updated to work on a per-directory basis quite trivially. For example:

for d in */; do
  find "$d" -name R1.fastq -prune -o -name '*R1*.fastq' -exec cat '{}' + >"$d/R1.fastq"
  find "$d" -name R2.fastq -prune -o -name '*R2*.fastq' -exec cat '{}' + >"$d/R2.fastq"
done

There are only two significant changes:

  • We're no longer specifying -mindepth, to ensure that our input files only come from subdirectories.
  • We're excluding R1.fastq and R2.fastq from our input files, so we never try to use the same file as both input and output. This is a consequence of the prior change: Previously, our output files couldn't be considered as input because they didn't meet the minimum depth.
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
1

Your grep is searching the file contents instead of file name. You could rewrite it this way:

for f in */*.fastq; do
  [[ -f $f ]] || continue
  if [[ $f = *R1* ]]; then
    cat "$f" >> R1.fastq
  elif [[ $f = *R2* ]]; then
    cat "$f" >> R2.fastq
  fi
done
codeforester
  • 39,467
  • 16
  • 112
  • 140
  • 1
    You might also suggest either a `case` statement, or built-in functionality such as `if [[ $f = *R1* ]]; then ...`, either of which will be significantly faster / lower-overhead than spinning up a copy of `grep` just to read one line. – Charles Duffy Jan 12 '17 at 22:23
  • (Do you plan on taking that suggestion? If not, I'll edit it into my answer; hadn't yet decided to answer on my own when commenting above). – Charles Duffy Jan 12 '17 at 22:32
  • @CharlesDuffy: I modified it as per your excellent suggestion. – codeforester Jan 12 '17 at 22:45
  • Oh, shoot; I'd taken you for idle and already started my own version of that. That said, yours is cleaner and simpler, so I think there's value to having both. – Charles Duffy Jan 12 '17 at 22:46
  • 1
    I definitely like your suggestion. I was blinded by OP's code and didn't think clearly like you did. Once again, thanks for your inputs and it has been great learning from you Charles. – codeforester Jan 12 '17 at 22:51
1

Find in a forloop might suit this:

  for i in R1 R2 
    do 
      find . -type f -name "*${i}*" -exec cat '{}' + >"$i.txt"
   done
Zlemini
  • 4,827
  • 2
  • 21
  • 23