0

I'm trying to find a match of a ~500 character long DNA sequence from a few megabyte large CSV file containing different sequences. Before each sequence in the CSV file, there is some metadata I would like to have. Each sequence and sequence metadata take up exactly one line. I've tried

grep -B 1 "extremelylongstringofDNATACGGCATAGAGGCCGAGACCTAGGATTAACGTTACTGACGAT" csvfile.csv

However that returns filename too long

An interesting and frustrating thing I bumped into was when I tried to find the line count of the csv file by using

wc -l csvfile.csv

it returned

0 csvfile.csv

And without the -l flag, it returned

0  161410 41507206 csvfile.csv

This is the result even after I added a line between the end of each sequence and the start of the following metadata of the next sequence.

riv
  • 119
  • 1
  • 1
  • 9
  • If `wc -l` can't see multiple lines in the file then the file doesn't have correct newlines/line-endings. What does `file csvfile.csv` say about the file? If you run `head -n 1 csvfile.csv` do you get one line or the entire file as output? – Etan Reisner Jul 21 '15 at 02:04
  • file csvfile.csv returned `CLL037_S2_L001_001_combined.txt: ASCII text, with very long lines, with CR line terminators` – riv Jul 21 '15 at 02:13
  • forgot that it was a text file – riv Jul 21 '15 at 02:14
  • and head returned the entire file – riv Jul 21 '15 at 02:15
  • Did you somehow manage to insert *only* a CR in each line break? Thus creating an old-Mac OS line-ending file? Because that's what it sounds like and is likely the problem. – Etan Reisner Jul 21 '15 at 02:16
  • That seemed to be the issue. I used mac2unix to convert CR line terminators and now Grep works. Thanks! – riv Jul 21 '15 at 02:40
  • place some special character at start and end,e.g. #, of that string and find string using `#.+?#` – Muhammad Imran Jul 21 '15 at 06:54

1 Answers1

1

The issue was that the file had CR line terminators and GNU tools were not detecting any line endings and therefore was reading the file as one huge line. I solved the issue by using mac2unix to convert the file to make it GNU line-ending readable.

Thanks to Etan Reisner for providing the hint

riv
  • 119
  • 1
  • 1
  • 9