2

I am working on a linux machine using bash.

My question is, how can I skip lines in the query file using grep?

I am working with a large ~16Gb .fastq file named example.fastq which has the following format.

example.fastq

@SRR6750041.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR6750041.3 3/1
ATCCANAATGATGTGTTGCTCTGGAGGTACAGAGATAACGTCAGCTGGAATAGTTTCCCCTCACAG
+
AAAAA#EE6E6EEEEEE6EEEEAEEEEEEEEEEE//EAEEEEEAAEAEEEAE/EAEEA6/EEA<E/
@SRR6750041.4 4/1
ACACCNAATGCTCTGGCCTCTCAAGCACGTGGATTATGCCAGAGAGGCCAGAGCATTCTTCGTACA
+
/AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/E/<//AEA/EA//E//
@SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E

I need to extract lines containing a strings of interest @SRR6750041.2 @SRR6750041.5 stored in a bash array called IDarray as well as the 3 lines following each match. The following grep command allows me to do this

for ID in "${IDarray[@]}";
    do
    grep -F -A 3 "$ID " example.fastq 
    done

This correctly output the following.

@SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E

I am looking for ways to speed this process up... one way would be to reduce the number of lines searched by grep by restricting the search to lines beginning with @ or skipping lines that can not possibly contain the match @SRR6750041.1 such as lines 2,3,4 and 6,7,8 etc. Is there a way to do this using grep? Alternative methods are also welcome!

Paul
  • 656
  • 1
  • 8
  • 23
  • Does each line have the same amount of characters? Even with `awk 'NR % 4 == 0'` I guess it will read the whole file anyway, so the I/O would be the same. – KamilCuk Nov 24 '18 at 20:40
  • this `or skipping lines 2,3,4 and 6,7,8 etc` seems unclear. Can you elaborate? – RomanPerekhrest Nov 24 '18 at 20:40
  • The match `@SRR6750041.1` can not possibly contained in lines 2,3,4 or 6,7,8 of the example.fastq. Therefore searching them with grep is wasted effort. Does this make sense? These lines are only of interest once the match has been found. – Paul Nov 24 '18 at 20:42
  • 2
    what is the issue of searching the exact pattern `^@SRR6750041.1` ? (`grep` is fast enough) – RomanPerekhrest Nov 24 '18 at 21:05
  • The dot might need a backslash `^@SRR6750041\.1` if it shouldn't match any character, but I doubt it will speed anything. – choroba Nov 24 '18 at 21:10
  • 1
    @RomanPerekhrest Great question! I thought it would be fast enough also... but with a ~16Gb input file and 10,046 individual `IDarray` lists of strings between 1kb and 100Mb in size it is taking several days to get through a run of this size run in parallel on 56 cores of a linux cluster. Here is a link to the github repository with the full script and a small example dataset https://github.com/paulranum11/splitseq_demultiplexing. I would really value any insight you have on how to speed this process. – Paul Nov 24 '18 at 21:17
  • `splitseqdemultiplex.sh` contains many loops and requires internal profiling to detect a bottleneck in that script (investigating which loop(s) makes the most time performance hit) – RomanPerekhrest Nov 24 '18 at 21:51
  • The grep command that is the focus of this question is “grepfunction2” in the splitseqdemultiplexing.sh script. It can be found under the step 2 heading. This is the bottleneck in the pipeline at this point especially when large .fastq files are used as input. – Paul Nov 24 '18 at 22:04

1 Answers1

1

Here are some thoughts with examples. For test purposes I created test case as mini version of Yours example_mini.fastq is 145 MB big and IDarray has 999 elements (interests).

Your version has this performance (more than 2 mins in user space):

$ time for i in "${arr[@]}"; do grep -A 3 "${i}" example_mini.fastq; done 1> out.txt
real    3m16.310s
user    2m9.645s
sys     0m53.092s

$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262  out.txt

First upgrade of grep to end grep after first match -m 1, I am assuming that interest ID is unique. This narrow down by 50% of complexity and takes approx 1 min in user space:

$ time for i in "${arr[@]}"; do grep -m 1 -A 3 "${i}" example_mini.fastq; done 1> out.txt
real    1m19.325s
user    0m55.844s
sys     0m21.260s

$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262  out.txt

These solutions are linearly dependent on number of elements. Call n times grep on huge file.

Now let's implement in AWK only for one run, I am exporting IDarray into input file so I can process in one run. I am loading big file into associative array per ID and then looping 1x through You array of IDs to search. This is generic scenario where You can define regexp and number of lines after to print. This has complexity with only one run through file + N comparisons. This is 2000% speed up:

$ for i in "${arr[@]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (linesafter-- > 0) { arr[interest]=arr[interest] RS $0; next; }
(FNR==NR) && /^@/ { interest=$1; arr[interest]=$0; linesafter=3; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real    0m7.044s
user    0m6.628s
sys     0m0.307s

$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262  out.txt

As in Your title If You really can confirm that every fourth line is id of interest and three lines after are about to be printed. You can simplify into this and speed up by another 20%:

$ for i in "${arr[@]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (FNR%4==1) { interest=$1; arr[interest]=$0; next; }
(FNR==NR) { arr[interest]=arr[interest] RS $0; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real    0m5.944s
user    0m5.593s
sys     0m0.242s

$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262  out.txt

On 1.5 GB file with 999 elements to search time is:

real    1m4.333s
user    0m59.491s
sys     0m3.460s

So per my predictions on my machine Your 15 GB example with 10k elements would take approx 16 minutes in user space to process.

Kubator
  • 1,373
  • 4
  • 13