I am trying to parse delimited protobuf messages (from a file) in C++ using the following implementation of readDelimitedFrom()
- also copied below:
bool readDelimitedFrom(
google::protobuf::io::ZeroCopyInputStream* rawInput,
google::protobuf::MessageLite* message) {
// We create a new coded stream for each message. Don't worry, this is fast,
// and it makes sure the 64MB total size limit is imposed per-message rather
// than on the whole stream. (See the CodedInputStream interface for more
// info on this limit.)
google::protobuf::io::CodedInputStream input(rawInput);
// Read the size.
uint32_t size;
if (!input.ReadVarint32(&size)) return false;
// Tell the stream not to read beyond that size.
google::protobuf::io::CodedInputStream::Limit limit =
input.PushLimit(size);
// Parse the message.
if (!message->MergeFromCodedStream(&input)) return false;
if (!input.ConsumedEntireMessage()) return false;
// Release the limit.
input.PopLimit(limit);
return true;
}
My issue is that I need to group messages and process them in batches based on a uint32_t
field contained within the message - let's call it id
.
Currently, I have the following code in my main loop:
...
int infd = -1;
_sopen_s(&infd, argv[1], _O_RDONLY | _O_BINARY, _SH_DENYWR, _S_IREAD);
google::protobuf::io::ZeroCopyInputStream *input =
new google::protobuf::io::FileInputStream(infd);
std::vector<ProtoMessage> msgList;
bool readMore = true;
do {
ProtoMessage msg;
readMore = readNextMessage(input, msg, msgList);
if (!msgList.empty()) {
std::cout << "Processing Message Batch - ID: " << msgList[0].id();
/* some processing done here */
}
} while (readMore);
The implementation of readNextMessage()
is as follows:
bool readNextMessage(
google::protobuf::io::ZeroCopyInputStream* rawInput,
ProtoMessage& nextMsg,
std::vector<ProtoMessage>& batchList) {
bool sameBatch = false;
uint32_t msgID = 0;
do {
if (readDelimitedFrom(rawInput, &scan) == -1)
return false;
if (nextMsg.id() == 0)
msgID = nextMsg.id(); // guaranteed to be non-zero
if (sameBatch = (msgID == nextMsg.id()))
batchList.push_back(nextMsg);
} while (sameBatch);
// need a way to roll-back here as nextMsg is now the first new
// ProtoMessage belonging to a new batch.
return true;
}
The logic of this function is fairly simple: take a ZeroCopyInputStream
and parse it using readDelimitedFrom()
to group ProtoMessage
messages into a vector based on their id
field. If it encounters a message with a new id, stop and return control back to main
for processing on the message batch.
This leads to the undesired requirement of having to consume/read the first message (including its Varint32-encoded size) that does not belong to the previous batch without having a way to 'backup' the stream. I would like to be able to point the ZeroCopyInputStream
to the location before the last readDelimitedFrom()
.
Is there any way for me to modify readDelimitedFrom()
to also return the number of bytes consumed during its call, and then use pointer arithmetic on the ZeroCopyInputStream
to achieve the desired functionality?
The provided function ZeroCopyInputStream::Backup()
has a precondition that ZeroCopyInputStream::Next()
be the last method call. Obviously, this is not the case when using the CodedInputStream
wrapper to parse delimited messages.