1

I am using the following grep script to output all the unmatched patterns:

grep -oFf patterns.txt large_strings.txt | grep -vFf - patterns.txt > unmatched_patterns.txt

patterns file contains the following 12-characters long substrings (some instances are shown below):

6b6c665d4f44
8b715a5d5f5f
26364d605243
717c8a919aa2

large_strings file contains extremely long strings of around 20-100 million characters longs (a small piece of the string is shown below):

121b1f212222212123242223252b36434f5655545351504f4e4e5056616d777d80817d7c7b7a7a7b7c7d7f8997a0a2a2a3a5a5a6a6a6a6a6a7a7babbbcbebebdbcbcbdbdbdbdbcbcbcbcc2c2c2c2c2c2c2c2c4c4c4c3c3c3c2c2c3c3c3c3c3c3c3c3c2c2c1c0bfbfbebdbebebebfbfc0c0c0bfbfbfbebebdbdbdbcbbbbbababbbbbcbdbdbdbebebfbfbfbebdbcbbbbbbbbbcbcbcbcbcbcbcbcbcb8b8b8b7b7b6b6b6b8b8b9babbbbbcbcbbbabab9b9bababbbcbcbcbbbbbababab9b8b7b6b6b6b6b7b7b7b7b7b7b7b7b7b7b6b6b5b5b6b6b7b7b7b7b8b8b9b9b9b9b9b8b7b7b6b5b5b5b5b5b4b4b3b3b3b6b5b4b4b5b7b8babdbebfc1c1c0bfbec1c2c2c2c2c1c0bfbfbebebebebfc0c1c0c0c0bfbfbebebebebebebebebebebebebebdbcbbbbbab9babbbbbcbcbdbdbdbcbcbbbbbbbbbbbabab9b7b6b5b4b4b4b4b3b1aeaca9a7a6a9a9a9aaabacaeafafafafafafafafafb1b2b2b2b2b1b0afacaaa8a7a5a19d9995939191929292919292939291908f8e8e8d8c8b8a8a8a8a878787868482807f7d7c7975716d6b6967676665646261615f5f5e5d5b5a595957575554525

How can we speed up the above script (gnu parallel, xargs, fgrep, etc.)? I tried using --pipepart and --block but it doesn't allow you to pipe two grep commands.

Btw these are all hexadecimal strings and patterns.

The working code below is a little faster than the traditional grep:

rg -oFf patterns.txt large_strings.txt | rg -vFf - patterns.txt > unmatched_patterns.txt

grep took an hour to finish the process of pattern matching while it took ripgrep around 45 mins.

  • 1) try `LC_ALL=C grep ...` 2) use [ripgrep](https://github.com/BurntSushi/ripgrep) – Sundeep Jan 25 '21 at 03:55
  • 3) try one or more `grep -oF -e '6b6c665d4f44' -e '8b715a5d5f5f' large_strings.txt` instead of `grep -oFf patterns.txt large_strings.txt` – Sundeep Jan 25 '21 at 03:59
  • Thanks @Sundeep. I am trying ripgrep now. Still waiting for it to finish. – user3441801 Jan 25 '21 at 04:28
  • @Sundeep doing the patterns individually will take a lot of time as my patterns.txt file has over 30 million patterns :( – user3441801 Jan 25 '21 at 04:30
  • well, if you have millions of search terms AND lines with millions of characters, a custom search engine would be needed to speed it up.. there may be some tool already available as well... I would suggest to ask on https://github.com/BurntSushi/ripgrep/discussions (the tool author and others may have better suggestions) – Sundeep Jan 25 '21 at 04:51
  • @Sundeep `ripgrep` worked for me. It's much faster and accurate than the traditional grep. Thank you so much. I will add the working code in my original post. – user3441801 Jan 25 '21 at 05:13
  • 2
    You might try ripgrep built with Hyperscan instead, it could be much faster: https://sr.ht/~pierrenn/ripgrep/ In general, searching for millions of patterns is a very specialized use case, and neither ripgrep nor GNU grep will handle it particularly well. It wouldn't surprise me if most of the runtime was actually being spent compiling the matcher rather than the actual search. But since you haven't provided a way to reproduce your results, it's impossible to say. – BurntSushi5 Jan 25 '21 at 15:41
  • @BurntSushi indeed I will try this out. I found the following script in conjunction with `--pipepart` and `--pipe` significantly faster: `parallel --pipepart --block -1 -a large_strings.txt rg -oFf patterns.txt | rg -Ff - patterns.txt > unmatched_patterns.txt` *credit to @Ole Tange – user3441801 Jan 26 '21 at 04:17

1 Answers1

1

If you do not need to use grep try:

build_k_mers() {
    k="$1"
    slot="$2"
    perl -ne 'for $n (0..(length $_)-'"$k"') {                                                                                               
       $prefix = substr($_,$n,2);                                                                                                            
       $fh{$prefix} or open $fh{$prefix}, ">>", "tmp/kmer.$prefix.'"$slot"'";                                                                
       $fh = $fh{$prefix};                                                                                                                   
       print $fh substr($_,$n,'"$k"'),"\n"                                                                                                   
    }'
}
export -f build_k_mers

rm -rf tmp
mkdir tmp
export LC_ALL=C
# search strings must be sorted for comm                                                                                                     
parsort patterns.txt | awk '{print >>"tmp/patterns."substr($1,1,2)}' &

# make shorter lines: Insert \n(last 12 char before \n) for every 32k                                                                         
# This makes it easier for --pipepart to find a newline                                                                                      
# It will not change the kmers generated                                                                                                     
perl -pe 's/(.{32000})(.{12})/$1$2\n$2/g' large_strings.txt > large_lines.txt
# Build 12-mers                                                                                                                              
parallel --pipepart --block -1 -a large_lines.txt 'build_k_mers 12 {%}'
# -j10 and 20s may be adjusted depending on hardware
parallel -j10 --delay 20s 'parsort -u tmp/kmer.{}.* > tmp/kmer.{}; rm tmp/kmer.{}.*' ::: `perl -e 'map { printf "%02x ",$_ } 0..255'`
wait
parallel comm -23 {} {=s/patterns./kmer./=} ::: tmp/patterns.??

I have tested this on patterns.txt: 9GBytes/725937231 lines, large_strings.txt: 19GBytes/184 lines and on my 64-core machine it completes in 3 hours.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104