Is there a good way to find exact matches of a extremely long string ~500 characters from a couple megabyte sized CSV file?

Question

I'm trying to find a match of a ~500 character long DNA sequence from a few megabyte large CSV file containing different sequences. Before each sequence in the CSV file, there is some metadata I would like to have. Each sequence and sequence metadata take up exactly one line. I've tried

grep -B 1 "extremelylongstringofDNATACGGCATAGAGGCCGAGACCTAGGATTAACGTTACTGACGAT" csvfile.csv

However that returns filename too long

An interesting and frustrating thing I bumped into was when I tried to find the line count of the csv file by using

wc -l csvfile.csv

it returned

0 csvfile.csv

And without the -l flag, it returned

0  161410 41507206 csvfile.csv

This is the result even after I added a line between the end of each sequence and the start of the following metadata of the next sequence.

If `wc -l` can't see multiple lines in the file then the file doesn't have correct newlines/line-endings. What does `file csvfile.csv` say about the file? If you run `head -n 1 csvfile.csv` do you get one line or the entire file as output? — Etan Reisner, Jul 21 '15 at 02:04
file csvfile.csv returned `CLL037_S2_L001_001_combined.txt: ASCII text, with very long lines, with CR line terminators` — riv, Jul 21 '15 at 02:13
Did you somehow manage to insert *only* a CR in each line break? Thus creating an old-Mac OS line-ending file? Because that's what it sounds like and is likely the problem. — Etan Reisner, Jul 21 '15 at 02:16
That seemed to be the issue. I used mac2unix to convert CR line terminators and now Grep works. Thanks! — riv, Jul 21 '15 at 02:40
place some special character at start and end,e.g. #, of that string and find string using `#.+?#` — Muhammad Imran, Jul 21 '15 at 06:54

score 1 · Accepted Answer · answered Jul 21 '15 at 14:59

The issue was that the file had CR line terminators and GNU tools were not detecting any line endings and therefore was reading the file as one huge line. I solved the issue by using mac2unix to convert the file to make it GNU line-ending readable.

Thanks to Etan Reisner for providing the hint

Is there a good way to find exact matches of a extremely long string ~500 characters from a couple megabyte sized CSV file?

1 Answers1