Protobuf lazy decoding of sub message

Question

I am using proto 3 (java) in my projects . I have some huge protobufs embedded with smaller messages . Is there a way I can acheive partial decoding of only few nested sub messages that I want to look at. The current issue I am having is I need to join this huge proto based record data with another records ,but my join are based on very small sub messages ,so I don't want to decode the entire huge protobuf and be able to only decode the nested message (string id) to join and then only decode the entire protobuf for the joined data.

I tried using the [lazy=true] tagging method , but I don't see any difference in generated code , also I tried benchmarking the deserialization time with and without the lazy key work and it didn't seem to affect at all . Is this feature by default on for all fields? Or is this even possible? I do see there are few classes LazyFields.java and test cases in the protobuf-github so I assume this feature has been implemented.

is the issue deserializing lots of unnecessary objects? or is it deserializing lots of unnecceary fields in necessary objects? the latter is fixable - the former is not — Marc Gravell, Sep 10 '17 at 00:10
@ Marc Gravell : can you please elaborate a little more with an example. Although from what i understand my issue is latter case i.e be able to decode only specific nested fields/submessages instead of all the fields or in a way lazy decoding fields. — user179156, Sep 10 '17 at 01:22
@MarcGravell : to clarifya bit more with an example : say i have few 100million giant proto objects , they all have a small nested message that is an identifier for the object . i need to filter out few proto objects that match my small list of identifiers , so for each proto object instead of deserializing the entire huge proto , which has may be be say hundreds of field/nested message , i only want to decode/deserialize the small identifier sub message . — user179156, Sep 10 '17 at 01:28
if you create a second `message` that *only has* the fields you want (and so on and so on with nested messages), then that *should* do most of what you describe, especially in proto3 where most libs don't store unexpected data for round-trip — Marc Gravell, Sep 10 '17 at 14:03
@MarcGravell I am not sure what exactly your solution is or if that even works. Why should i even create a second message ? I do need all of the content of message , i just need to decode only some fields and if that is something what i need i need to decode it further to get rest of content. Second message doesn't work and seems pretty hacky workaround — user179156, Sep 12 '17 at 03:52
protobuf doesn't care what things are *called* - it only cares about the *shape* of the data. If you have a message with 27 fields and you only want to decode 3 of them, creating a separate message type with *just those fields* (as the same field numbers, obviously) will decode *just those fields* - *as long as* your library isn't storing the other data for round trip, which proto3 doesn't usually do. — Marc Gravell, Sep 12 '17 at 08:29
@MarcGravell : i am not sure what you mean by creating another message , i don't think i even need to create separate message . Can you write a small code snippet to explain. The solution you present , why does it has anything to do with "round trip" , All i need is to decode few fields within protobuf (lets forget about trip/roundtrip data) . I have a huge proto message , that i don't want to decode all , i just want to decode it lazily , that is decode some small sub-message , if that small message meets my criteria , decode the entire large message. — user179156, Sep 12 '17 at 19:26
@MarcGravell : your statement "creating a separate message type with just those fields (as the same field numbers, obviously) will decode just those fields" is completely wrong . If it depends on what data is stored in proto , then i don't even think it is a solution , if i need to pass in data that i need then there is no point of creating secodn message , i can just pass smaller message but that is not a solution at all. — user179156, Sep 12 '17 at 19:28

score 0 · Answer 1 · answered Dec 11 '20 at 17:34

For those that happen to look at this conversation later and finding it hard to understand, here's what Marc's talking about:

If your object is something like

message MyBigMessage{
  string id = 1;
  int sourceType = 2 ;
  And many other fields here, that would be expensive to parse .......

}

And you get a block of bytes that you have to parse. But you want to only parse messages from a certain source and maybe match a certain id range. You could first parse those bytes with another message as:

message MyFilterMessage{
  string id = 1; //has to be 1 to match
  int sourceType = 2 ; //has to be 1 to match
  And NOTHING ELSE here.......
}

And then, you could look at sourceType and id. If they match whatever you are filtering for, then, you could go and parse the bytes again, but this time, using MyBigMessage to parse the whole thing.

One other thing to know: FYI: As of 2017, lazy parsing was disabled in Java (except MessageSet) according to this post: https://github.com/protocolbuffers/protobuf/issues/3601#issuecomment-341516826 I dont know the current status. Too lazy to try to find out ! :-)

Protobuf lazy decoding of sub message

1 Answers1

Linked