43

I'm using google protocol buffer to serialize equity market data (ie. timestamp, bid,ask fields). I can store one message into a file and deserialize it without issue.

How can I store multiple messages into a single file? Not sure how I can separate the messages. I need to be able to append new messages to the file on the fly.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
DD.
  • 21,498
  • 52
  • 157
  • 246
  • 1
    @Anony-Mousse: reading would not work without delimiters if you write more than 1 top level message in a file/stream. See accepted answer from Marc Gravell and https://developers.google.com/protocol-buffers/docs/techniques#streaming – Guillaume Perrot Jan 15 '13 at 15:04

5 Answers5

36

I would recommend using the writeDelimitedTo(OutputStream) and parseDelimitedFrom(InputStream) methods on Message objects. writeDelimitedTo writes the length of the message before the message itself; parseDelimitedFrom then uses that length to read only one message and no farther. This allows multiple messages to be written to a single OutputStream to then be parsed separately. For more information, see https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/MessageLite#writeDelimitedTo(java.io.OutputStream)

Josh Hansen
  • 1,408
  • 2
  • 17
  • 20
  • I'm not sure---I'm pretty new to the protobuf world myself. – Josh Hansen Feb 21 '13 at 18:34
  • 1
    Do they have an equivalent in C++? – Andrew Hundt Apr 04 '14 at 17:34
  • Looks like this has always been there (in the Java implementation at least). It's in this rev from 2009: https://code.google.com/p/protobuf/source/browse/trunk/java/src/main/java/com/google/protobuf/MessageLite.java?spec=svn163&r=163 – simonp Jun 06 '14 at 08:05
  • 7
    @AndrewHundt The C++ equivalent is provided by the author of the Protobuf C++ library in this [stackoverflow answer](http://stackoverflow.com/a/22927149/757777) – Erik Sjölund Oct 16 '15 at 11:53
14

From the docs:

http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming

Streaming Multiple Messages

If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

DD.
  • 21,498
  • 52
  • 157
  • 246
6

If you're looking for a C++ solution, Kenton Varda submitted a patch to protobuf around August 2015 that adds support for writeDelimitedTo() and readDelimitedFrom() calls that will serialize/deserialize a sequence of proto messages to/from a file in a way that's compatible with the Java version of these calls. Unfortunately this patch hasn't been approved yet, so if you want the functionality you'll need to merge it yourself.

Another option is Google has open sourced protobuf file reading/writing code through other projects. The or-tools library, for example, contains the classes RecordReader and RecordWriter that serialize/deserialize a proto stream to a file.

If you would like stand-alone versions of these classes that have almost no external dependencies, I have a fork of or-tools that contains only these classes. See: https://github.com/moof2k/recordio

Reading and writing with these classes is straightforward:

File* file = File::Open("proto.log", "w");
RecordWriter writer(file);
writer.WriteProtocolMessage(msg1);
writer.WriteProtocolMessage(msg2);
...
writer.Close();
Community
  • 1
  • 1
moof2k
  • 1,678
  • 1
  • 17
  • 19
6

Protobuf does not include a terminator per outermost record, so you need to do that yourself. The simplest approach is to prefix the data with the length of the record that follows. Personally, I tend to use the approach of writing a string-header (for an arbitrary field number), then the length as a "varint" - this means the entire document is then itself a valid protobuf, and could be consumed as an object with a "repeated" element, however, just a fixed-length (typically 32-bit little-endian) marker would do just as well. With any such storage, it is appendable as you require.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 1
    After all this I found that CSV ended up being smaller! Probably because most of the time my numbers fit into single characters. – DD. Feb 05 '12 at 15:00
  • Hi Marc, can you elaborate on the 'string-header' idea ? – AntonioD Dec 14 '12 at 04:48
  • 1
    @AntonioD by "string-header", I mean the "length-delimited" encoding (wire-type 2), as used by strings and other sub-data. Basically: pick your arbitrary field-number, left-shift it by 3, "or" it with 2 (the wire-type), and varint-encode the result (this is the standard process for representing a field-header in protobuf). So if your arbitrary field is `1`, you just prepend with 10 / 0x0A. – Marc Gravell Dec 14 '12 at 07:34
  • @Perrot not really, no - I'm not a Java person – Marc Gravell Jan 15 '13 at 15:31
  • I was in the process of writing an example to replace my comment. And you replied at that moment ;) – Guillaume Perrot Jan 17 '13 at 15:04
  • 5
    To write several protobuf messages to a stream/file, wrap your output stream into a CodedOutputStream `CodedOutputStream writer = CodedOutputStream.newInstance(outputStreamToWrite); writer.writeRawVarint32(bytes.length); writer.writeRawBytes(bytes);` To read the entire file: `CodedInputStream is = CodedInputStream.newInstance(inputStreamToWrap); while (!is.isAtEnd()) {int size = is.readRawVarint32(); YourMessage.parseFrom(is.readRawBytes(size);}` – Guillaume Perrot Jan 17 '13 at 15:07
-7

An easier way is to base64 encode each message and store it as a record per line.

creatiwit
  • 221
  • 2
  • 11