0

I am cleaning up sequencing data. I have a file with reads (names) I'd like to find with grep and eventually remove. The patternfile file is 219,721 lines long, with no duplicate entries. The sequence .fastq file is 557,514,608 lines long with no names that are duplicated.

I used: grep -f patternfile.txt sequencefile.fastq > outputfile.txt

I am expecting the output file to be identical to the patternfile (except with 1:N:0:ACTGAT included at the end) but instead the outputfile has 135 complete extra lines (names). These extra names are not duplicates, and are not found in the patternfile. I can open the output file and identify the extra lines. An example is shown below for patternfile lines 340-342:

@NB501827:133:HMV5HAFX2:1:11101:13856:12920
@NB501827:133:HMV5HAFX2:1:11101:16016:12934
@NB501827:133:HMV5HAFX2:1:11101:19446:12943

The output file is identical up to line 341 as shown below:

@NB501827:133:HMV5HAFX2:1:11101:13856:12920 1:N:0:ACTGAT
@NB501827:133:HMV5HAFX2:1:11101:26336:12921 1:N:0:ACTGAT
@NB501827:133:HMV5HAFX2:1:11101:16016:12934 1:N:0:ACTGAT
@NB501827:133:HMV5HAFX2:1:11101:19446:12943 1:N:0:ACTGAT

Note that the erroneous extra line at line 341 @NB501827:133:HMV5HAFX2:1:11101:26336:12921 1:N:0:ACTGAT is present, and is just one example of the other 134 extra lines.

Why am I doing this? This is a paired end read sequencing experiment and I found 219,721 instances where the reads in the "sequencefileR2" file were a string of 75 "G" and obvious errors due to the sequencing. I was able to use grep to pull out these sequence names, and now want to remove the corresponding reads in both files (sequencefileR1 and sequencefileR2). The plan was to use the inverse flag of grep (e.g. grep -v) to produce sequence files without these specific sequences. I checked the grep output before producing the final files and found this issue.

What have I tried? I have tried making sure there are no Windows (DOS) end-of-lines present. I have tried including the 1:N:0:ACTGAT in the patternfile I have tried this command on three different filesystems (CentOS7, Gitbash, Cygwin) with identical results (always get exactly the same output). I have tried egrep I have used the patternfile individual lines 340, 341, and 342 (and also the erroneous output line) shown above, and only get one output line from the sequence file (e.g.)

grep @NB501827:133:HMV5HAFX2:1:11101:13856:12920 sequencefileR2.fastq
@NB501827:133:HMV5HAFX2:1:11101:13856:12920 1:N:0:ACTGAT

I have tried removing the @ symbol from every line of the patternfile but got identical results. I have tried putting the grep in a loop (which didn't work, they were amateur attempts)

for pattern in 'R1-R2-names.txt'; do     grep "$pattern"
 L_lactis_S1_LALL_R2_001.fastq >> loopr1names; done

for pattern in 'cat R1-R2-names.txt'; do     grep "$pattern"
 L_lactis_S1_LALL_R2_001.fastq >> loopr1names; done

I'm open to sed and awk solutions but would like to understand why this simple bash solution is not working. Thanks.

hoytpr
  • 5
  • 3
  • 1
    Can you reduce this to a smaller example that we can try to reproduce? – Barmar Mar 15 '21 at 18:23
  • If you do `grep @NB501827:133:HMV5HAFX2:1:11101:26336:12921 patternfile` it doesn't show anything? – Barmar Mar 15 '21 at 18:26
  • Note that `grep` doesn't care about order, it will find matches even if they're not in the same order in the two files. Use `comm` to compare files in order. – Barmar Mar 15 '21 at 18:27
  • `grep @NB501827:133:HMV5HAFX2:1:11101:26336:12921 patternfile` gives nothing – hoytpr Mar 15 '21 at 18:30
  • Your `for` loops won't work. You're not getting the contents of `R1-R2-names.txt`, you're using that as the literal string to set `pattern` to. – Barmar Mar 15 '21 at 18:31
  • ```for pattern in $(cat R1-R2-names.txt)``` – Barmar Mar 15 '21 at 18:32
  • It's not order right? or it would show up in the above line – hoytpr Mar 15 '21 at 18:32
  • Right. I was just trying to confirm your expectations. When you said the output was identical to the pattern file until that line, it seemed like you expected them to be in the same order. – Barmar Mar 15 '21 at 18:33
  • Is `@NB501827:133:HMV5HAFX2:1:11101:13856:1292` in the pattern file? Or some shorter variant of that? – Barmar Mar 15 '21 at 18:35
  • 1
    If so, use `grep -w` to make sure the match is a whole word. – Barmar Mar 15 '21 at 18:36
  • we really need a sample set of input data files (smaller the better) that demonstrate the issue so that **1)** we can reproduce the issue and **2)** possibly explain what you're seeing; while you've stated the line ending in `...12921` should not be in the output, keep in mind this line could have been found based on a match with `...12`, `...129`, `...1292` and `...12921`; are you 100% sure the pattern file has been sorted and that a pattern ending in `...12921` doesn't exist somewhere out-of-order in the patternfile? – markp-fuso Mar 15 '21 at 18:38
  • There are shorter variants like "@NB501827:133:HMV5HAFX2:1:11101:138" – hoytpr Mar 15 '21 at 18:40
  • Yes `12921` exists in the patternfile twice at the end of lines. Is that enough to cause this? I will try using -w which I should have tried already. – hoytpr Mar 15 '21 at 18:46

1 Answers1

0

Use

grep -w -F -f patternfile.txt sequencefile.fastq > outputfile.txt

-w means to match the pattern only when it's surrounded by word boundaries. -F means to match fixed-text patterns, not regular expressions (this is probably not significant here, as your patterns don't seem to contain any characters that have special meaning, but it's good practice).

I suspect your pattern file contains a prefix of @NB501827:133:HMV5HAFX2:1:11101:26336:12921, so it's matching this line. The -w option will will prevent matching these prefixes.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Yes, the `-w` flag fixed it, although `@NB501827:133:HMV5HAFX2:1:11101:26336:12921` was not found in the patternsfile. ```grep -w -f R1-R2-names.txt sequencefile.fastq > outputfile.txt``` fixed it. – hoytpr Mar 15 '21 at 18:53
  • 1
    Because the pattern file contains a prefix like `:1292` or `:129` or `:12` or `:1` – Barmar Mar 15 '21 at 18:56