Best delimiter to safely parse byte arrays from a stream

Question

I have a byte stream that returns a sequence of byte arrays, each of which represents a single record.

I would like to parse the stream into a list of individual byte[]s. Currently, i have hacked in a three byte delimiter so that I can identify the end of each record, but have concerns.

I see that there is a standard Ascii record separator character.

30  036 1E  00011110    RS  &#030;      Record Separator

Is it safe to use a byte[] derived from this character a delimiter if the byte arrays (which were UTF-8 encoded) have been compressed and/or encrypted? My concern is that the encryption/compression output might produce the record separator for some other purpose. Please note the individual byte[] records are compressed/encrypted, rather than the entire stream.

I am working in Java 8 and using Snappy for compression. I haven't picked an encryption library yet, but it would certainly be one of the stronger, standard, private key approaches.

Private key approaches aren't stronger. They are just used for different applications, mainly key agreement or key transport. You should encode the length of each message in the stream rather than trying to choose an delimiter. It's safer and easier. — erickson, Aug 14 '15 at 17:26
@erickson i think you missed the comma between stronger and private key. I was saying that of the standard private key algorithms, I would choose one of the stronger ones. Also, you are confusing private and public key crypto. Public key crypto is used for agreement and to transport private (symmetric) keys. — L. Blanc, Aug 14 '15 at 18:29
Okay, wanted to make sure that you weren't laboring under the common misconception that asymmetric algorithms are somehow more secure than symmetric algorithms. "Symmetric" or "secret key" is much less likely to be misunderstood than "private key." — erickson, Aug 14 '15 at 19:28

Durandal · Answer 1 · 2015-08-14T17:28:18.177

You can't simply declare a byte as delimiter if you're working with random unstructured data (which compressed/encrypted data resembles quite closely), because the delimiter can always appear as a regular data byte in such data.

If the size of the data is already known when you start writing, just generally write the size first and then the data. When reading back you then know you need th read the size first (e.g. 4 bytes for an int), and then as many bytes the size indicates.

This will obviously not work if you can't tell the size while writing. In that case, you can use an escaping mechanism, e.g. select a rarely appearing byte as the escapce character, escape all occurances of that byte in the data and use a different byte as end indicator.

e.g.

final static byte ESCAPE = (byte) 0xBC;
final static byte EOF = (byte) 0x00;

OutputStream out = ...
for (byte b : source) {
    if (b == ESCAPE) {
        // escape data bytes that have the value of ESCAPE
        out.write(ESCAPE);
        out.write(ESCAPE);
     } else {
        out.write(b);
     }
}
// write EOF marker ESCAPE, EOF
out.write(ESCAPE);
out.write(EOF);

Now when reading and you read the ESCAPE byte, you read thex next byte and check for EOF. If its not EOF its an escaped ESCAPE that represents a data byte.

InputStream in = ...
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
while ((int b = in.read()) != -1) {
    if (b == ESCAPE) {
        b = in.read();
        if (b == EOF)
            break;
        buffer.write(b);
    } else {
         buffer.write(b);
    }
}

If the bytes to be written are perfectly randomly distributed this will increase the stream length by 1/256, for data domains that are not completely random, you can select the byte that is least frequently appearing (by static data analysis or just educated guess).

Edit: you can reduce the escaping overhead by using more elaborate logic, e.g. the example can only create ESCAPE + ESCAPE or ESCAPE + EOF. The other 254 bytes can never follow an ESCAPE in the example, so that could be exploited to store legal data combinations.

Rather than escaping, most protocols that support messages where the length is not known in advance use a chunking approach, where each chunk is length-encoded, and then the final chunk has a length of zero to signal the end of the message. — erickson, Aug 14 '15 at 17:28

bmargulies · Answer 2 · 2015-08-14T17:48:42.793

2

It is completely unsafe, you never know what might turn up in your data. Perhaps you should consider something like protobuf, or a scheme like 'first write the record length, then write the record, then rinse, lather, repeat'?

If you have a length, you don't need a delimiter. Your reading side reads the length, then knows how much to read for the first record, and then knows to read the next length -- all assuming that the lengths themselves are fixed-length.

See the developers' suggestions for streaming a sequence of protobufs.

edited Aug 14 '15 at 17:48

answered Aug 14 '15 at 16:34

bmargulies

97,814
39
186
310

As it happens, they are protobufs (ignoring compression/encryption), but stream of protobufs, not a single protobuf. Is there a standard way that a stream of individual protobufs would be delimited? Also, given the record length scheme suggested, wouldn't I still need to identify the start of each record? Not sure how that helps. – L. Blanc Aug 14 '15 at 16:43

Best delimiter to safely parse byte arrays from a stream

2 Answers2