0

I am trying to parse delimited protobuf messages (from a file) in C++ using the following implementation of readDelimitedFrom() - also copied below:

bool readDelimitedFrom(
    google::protobuf::io::ZeroCopyInputStream* rawInput,
    google::protobuf::MessageLite* message) {
  // We create a new coded stream for each message.  Don't worry, this is fast,
  // and it makes sure the 64MB total size limit is imposed per-message rather
  // than on the whole stream.  (See the CodedInputStream interface for more
  // info on this limit.)
  google::protobuf::io::CodedInputStream input(rawInput);

  // Read the size.
  uint32_t size;
  if (!input.ReadVarint32(&size)) return false;

  // Tell the stream not to read beyond that size.
  google::protobuf::io::CodedInputStream::Limit limit =
      input.PushLimit(size);

  // Parse the message.
  if (!message->MergeFromCodedStream(&input)) return false;
  if (!input.ConsumedEntireMessage()) return false;

  // Release the limit.
  input.PopLimit(limit);

  return true;
}

My issue is that I need to group messages and process them in batches based on a uint32_t field contained within the message - let's call it id.

Currently, I have the following code in my main loop:

...
int infd = -1;
_sopen_s(&infd, argv[1], _O_RDONLY | _O_BINARY, _SH_DENYWR, _S_IREAD);

google::protobuf::io::ZeroCopyInputStream *input = 
    new google::protobuf::io::FileInputStream(infd);

std::vector<ProtoMessage> msgList;
bool readMore = true;

do {
    ProtoMessage msg;
    readMore = readNextMessage(input, msg, msgList);

    if (!msgList.empty()) {
        std::cout << "Processing Message Batch - ID: " << msgList[0].id();
        /* some processing done here */
    }
} while (readMore);

The implementation of readNextMessage() is as follows:

bool readNextMessage(
    google::protobuf::io::ZeroCopyInputStream* rawInput,
    ProtoMessage& nextMsg,
    std::vector<ProtoMessage>& batchList) {

    bool sameBatch = false;
    uint32_t msgID = 0;
    do {
        if (readDelimitedFrom(rawInput, &scan) == -1)
            return false;
        if (nextMsg.id() == 0)
            msgID = nextMsg.id();    // guaranteed to be non-zero
        if (sameBatch = (msgID == nextMsg.id()))
            batchList.push_back(nextMsg); 
    } while (sameBatch); 

    // need a way to roll-back here as nextMsg is now the first new
    // ProtoMessage belonging to a new batch.

    return true;
}

The logic of this function is fairly simple: take a ZeroCopyInputStream and parse it using readDelimitedFrom() to group ProtoMessage messages into a vector based on their id field. If it encounters a message with a new id, stop and return control back to main for processing on the message batch.

This leads to the undesired requirement of having to consume/read the first message (including its Varint32-encoded size) that does not belong to the previous batch without having a way to 'backup' the stream. I would like to be able to point the ZeroCopyInputStream to the location before the last readDelimitedFrom().

Is there any way for me to modify readDelimitedFrom() to also return the number of bytes consumed during its call, and then use pointer arithmetic on the ZeroCopyInputStream to achieve the desired functionality?

The provided function ZeroCopyInputStream::Backup() has a precondition that ZeroCopyInputStream::Next() be the last method call. Obviously, this is not the case when using the CodedInputStream wrapper to parse delimited messages.

Community
  • 1
  • 1
KnightsValour
  • 253
  • 2
  • 10

1 Answers1

1

ZeroCopyInputStream::Backup() can only back up over the last buffer received. A single message may span multiple buffers, therefore there's no general way to do what you want given the ZeroCopyInputStream interface.

Some options:

  • Call rawInput->ByteCount() before parsing each message, in order to determine exactly the byte position where the message started. If you need to roll back, seek the underlying file backwards and recreate the ZeroCopyInputStream on top of it. This only works if you are reading from a file, of course.
  • When you encounter a message in a new batch, store it off to the side, and then bring it back out when the caller asks to start reading the next batch.
Kenton Varda
  • 41,353
  • 8
  • 121
  • 105
  • Thanks Kenton - I figured this was the case. I'm currently doing the second option you mentioned. So I guess whatever a `ZeroCopyInputStream` has read in the past is not accessible anymore? As in, we need to seek its underlying file and re-create as opposed to dealing with the same `ZeroCopyInputStream`? – KnightsValour Sep 25 '15 at 15:31
  • @KnightsValour Yes, basically. – Kenton Varda Sep 30 '15 at 00:40