0

Problem

I am describing a very simplified version of my problem here. I have a huge file (10-50GB) which I need to split into millions of chunks. Suppose I have certain lines containing a string "SPLITTER". I need to split the file by those lines. Each chunk will contain the text between two SPLITTER lines.

This is of course very simplified, and the actual use-case will involve a bit more complicated matching/splitting.

Question

So we have a streaming matching problem here. Which is more efficient: Treat this like a string matching problem, use a buffered reader to read lines and compare and split depending upon that? - OR - Treat the file like a binary input stream, treat the splitter strings like an Array[Byte] and do byte-wise comparisons?

I need to get an insight before I start implementing something.

I am using Java. Also, the original input will decompressed from bz2 on-the-fly, for what it's worth.

Nilesh
  • 1,222
  • 1
  • 11
  • 23
  • *Can* you even treat the file as a bunch of strings? Arbitrary binary data doesn't generally make valid `String`s, though I don't know whether Java enforces this. –  May 23 '14 at 23:47
  • @delnan That's my point. Treating the file as a bunch of strings technically means something like new BufferedReader(new InputStreamReader(someBinaryInputStream)) - and performing my matching splitting stuff using this Reader. Doesn't this incur additional overhead? Since the file is tens of GBs and my matching strings are only maybe dozens of characters, converting those to bytes and directly byte-matching should be faster? – Nilesh May 23 '14 at 23:54
  • 1
    But that's not *my* point. When you have a bunch of `byte`s, interpreting them as `String` (or `char[]`) is not correct in general. Depending on how you do it, you might mangle the binary data, miss a SPLITTER line, generate invalid strings and cause who-knows-what errors in string processing, or some other nonsense. Whatever is in between your "SPLITTER lines", if it's really binary data you can't make a string out of it anyway, so *the question is moot*. –  May 24 '14 at 00:00
  • I see what you mean now, and I agree. But I must have mentioned that my particular use case actually deals with string data. I'm just decompressing the compressed bz2 in Java and getting it as a binary stream. – Nilesh May 24 '14 at 00:08
  • In the stream which you split based on `SPLITTER`, are those strings still bz2 binary data, or already decompressed? In the former case (splitting bz2 streams), nothing changes: The data in between is binary *now*, even if it is *converted* into proper strings later. –  May 24 '14 at 00:11
  • Directly splitting bz2 streams would be too risky - there are multistream bz2's and what-not - so both `SPLITTER` and the data are actually string. – Nilesh May 24 '14 at 00:24
  • Okay, nevermind then. –  May 24 '14 at 00:36

1 Answers1

1

It's always going to be quicker treating the data as raw bytes. Getting the data as strings means reading it as bytes then creating new strings.

SimonC
  • 6,590
  • 1
  • 23
  • 40