Problem
I am describing a very simplified version of my problem here. I have a huge file (10-50GB) which I need to split into millions of chunks. Suppose I have certain lines containing a string "SPLITTER". I need to split the file by those lines. Each chunk will contain the text between two SPLITTER lines.
This is of course very simplified, and the actual use-case will involve a bit more complicated matching/splitting.
Question
So we have a streaming matching problem here. Which is more efficient: Treat this like a string matching problem, use a buffered reader to read lines and compare and split depending upon that? - OR - Treat the file like a binary input stream, treat the splitter strings like an Array[Byte] and do byte-wise comparisons?
I need to get an insight before I start implementing something.
I am using Java. Also, the original input will decompressed from bz2 on-the-fly, for what it's worth.