-1

I have the following directories:

P922_101
P922_102
.
.

Each directory, for instance P922_101 has following subdirectories:

140311_AH8MHGADXX  140401_AH8CU4ADXX

Each subdirectory, for instance 140311_AH8MHGADXX has the following files:

1_140311_AH8MH_P922_101_1.fastq.gz  1_140311_AH8MH_P922_101_2.fastq.gz    
2_140311_AH8MH_P922_101_1.fastq.gz  2_140311_AH8MH_P922_101_2.fastq.gz

And files in 140401_AH8CU4ADXX are:

1_140401_AH8CU_P922_101_1.fastq.gz  1_140401_AH8CU_P922_4001_2.fastq.gz        
2_140401_AH8CU_P922_101_1.fastq.gz  2_140401_AH8CU_P922_4001_2.fastq.gz

I want to do 'cat' for the files in the subdirectories in the following way:

cat 1_140311_AH8MH_P922_101_1.fastq.gz 2_140311_AH8MH_P922_101_1.fastq.gz
1_140401_AH8CU_P922_101_1.fastq.gz 2_140401_AH8CU_P922_101_1.fastq.gz > P922_101_1.fastq.gz

which means that files ending with _1.fastq.gz should be concatenated into a single file and files ending with _2.fatsq.gz into another file.

It should be run for all files in subdirectories in all directories. Could someone give a linux solution to do this?

chas
  • 1,565
  • 5
  • 26
  • 54

2 Answers2

0

You can use find for this:

find /top/path -mindepth 2 -type f -name "*_1.fastq.gz" -exec cat {} \; > one_file
find /top/path -mindepth 2 -type f -name "*_2.fastq.gz" -exec cat {} \; > another_file

This will look for all the files starting from /top/path and having a name matching the pattern _1.fastq.gz / _2.fastq.gz and cat them into the desired file. -mindepth 2 makes find look for files that are at least under the current directory; this way, files in /top/path won't be matched.

Note that you will probably need zcat instead of cat, for gz files.


As you keep adding details in comments, let's see what else we can do:

Say you have the list of directories in a file directories_list, each line containing one:

while read directory
do
   find $directory -mindepth 2 -type f -name "*_1.fastq.gz" -exec cat {} \; > $directory/output
done < directories_list
fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • thanks!! is it possible to print which files are concatenated? just to make sure everything is going correct. – chas Sep 12 '14 at 14:42
  • @user1779730 doing `find /top/path -type f -name "*_1.fastq.gz"` alone will give you the names of the files. – fedorqui Sep 12 '14 at 14:43
  • In the above example, the files only in the subdirectories of the root directory should be concatenated. Which mean concatenation is independent of the root directory. – chas Sep 12 '14 at 14:44
  • @user1779730 I see. Using `-mindepth 2` should make. See my update. – fedorqui Sep 12 '14 at 14:47
  • Even -mindepth seems to do the same task as before. In the initial post, P922_101 and P922_102 are the main directories. Each of these main directories have subdirectories with files. The files in subirectories of P922_101 should be concatenated to the main directory P922_101, and the files in subdirectories of P922_102 should be concatenated into main directory P922_102. – chas Sep 12 '14 at 14:54
  • @user1779730 then use full paths on everything: `find /path/of/dir1 -mindepth 2 .... -exec cat {} \; > /path/of/dir1/output1` and the same with path2. – fedorqui Sep 12 '14 at 14:57
  • Thank you. But i have hundred of such directories and it would be cumbersome to do for all by specifying the path. Could you suggest a better way? – chas Sep 12 '14 at 15:12
  • @user1779730 see update. Try to do some research and give all the details before hand. It is quite tiring to get more and more conditions after the question was formulated. – fedorqui Sep 12 '14 at 15:19
  • @user1779730 so what did you end up doing? Remember you can accept an answer if you are done. – fedorqui Sep 15 '14 at 10:15
0

Since they're compressed, you should probably use gzip -dc (decompress and write to stdout) -

find /somePath -type f -name "*.fastq.gz" -exec gzip -dc {} \; | \
    tee -a /someOutFolder/out.txt
Elliott Frisch
  • 198,278
  • 20
  • 158
  • 249