Google ProtoBuf serialization / deserialization

Question

I am reading Google Protocol Buffers. I want to know Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.

Rather I want to send objects from any language to Java Server. and deserialize it there.

Assume following is my .proto file

message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
}

I ran protoc on this and created a C++ object. Basically Now i want to send the serialized stream to java server.

Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value

Compile the same `.proto` file and generate Java code. Then use that when deserializing. — maba, Jun 13 '13 at 08:14
Oh , so it mean I am using this for client-server kind of model, I will need to have .proto file on the server side also. Right !!!! — Avinash, Jun 13 '13 at 08:16
He needs to use the `--java_out` flag to make `protoc` produce java code IIRC — Anya Shenanigans, Jun 13 '13 at 08:16
@Avinash: yes, all your applications using the protobuf have to know the `.proto` file. See my answer. — Matthieu Rouget, Jun 13 '13 at 08:19

score 2 · Accepted Answer · answered Jun 13 '13 at 08:30

Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value

You will need to know the schema in advance. Firstly, protobuf does not transmit names; all it uses as identifiers is the numeric key (1, 2 and 3 in your example) of each field. Secondly, it does not explicitly specify the type; there are only a very few wire-types in protobuf (varint, 32-bit, 64-bit, length-prefix, group); actual data types are mapped onto those, but you cannot unambiguously decode data without the schema

varint is "some form of integer", but could be signed, unsigned or "zigzag" (which allows negative numbers of small magnitude to be cheaply encoded), and could be intended to represent any width of data (64 bit, 32 bit, etc)
32-bit could be an integer, but could be signed or unsigned - or it could be a 32-bit floating-point number
64-bit could be an integer, but could be signed or unsigned - or it could be a 64-bit floating-point number
length-prefix could be a UTF-8 string, a sequence or raw bytes (without any particular meaning), a "packed" set of repeated values of some primitive type (integer, floating point, etc), or could be a structured sub-message in protobuf format
groups - hoorah! this is always unambigous! this can only mean one thing; but that one thing is largely deprecated by google :(

So fundamentally: you need the schema. The encoded data does not include what you want. It does this to avoid unnecessary space - if the protocol assumes that the encoder and decoder both know what the message is meant to look like, then a lot less information needs to be sent.

Note, however, that the information that is included is enough to safely round-trip a message even if there are fields that are not expected; it is not necessary to know the name or type if you only need to re-encode it to pass it along / back.

What you can do is use the parser API to scan over the data to reveal that there are three fields, field 1 is a varint, field 2 is length-prefixed, field 3 is length-prefixed. You could make educated guesses about the data beyond that (for example, you could see whether a UTF-8 decode produces something that looks roughly text-like, and verify that UTF-8 encoding that gives you back the original bytes; if it does, it is possible it is a string)

hmmm Thanks Marc, I guess Apache Avro makes more sense as I do not want to have schema on server side and Apache Avro says "When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present." — Avinash, Jun 13 '13 at 08:44
@Avinash one footnote to that; a protobuf .proto schema can be packed and transmitted as a protobuf message; but, this doesn't happen by default and there are no particular APIs for making this transparent — Marc Gravell, Jun 13 '13 at 10:03
Thanks Marc, But even If I transmit the schema, I will still have to generate java code on server side, to parse the serialized message Right !!!! — Avinash, Jun 13 '13 at 10:20
No @Avinash you can use a DynamicMessage to inspect message being sent (you still need to send the schema though). It may well be easier in Avro as it includes the schema in the message — Bruce Martin, Jun 13 '13 at 11:26
*“What you can do is use the parser API to scan over the data to reveal that there are three fields, field 1 is a varint, field 2 is length-prefixed, field 3 is length-prefixed.”* Any example/documentation/discussion/pointer about how to do this? There is very limited resource online talking about this (maybe it is against the design goal of Protocol Buffer?), but I need this exactly. Is it [`CodedInputStream`](https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/CodedInputStream)? — Franklin Yu, Feb 12 '18 at 20:34

score 1 · Answer 2 · answered Jun 13 '13 at 08:15

1

Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.

Yes, it is the very goal of protobuf.

Serialize data in an application developed in any supported language, and deserialize data in an application developed in any supported language. Serialization and deserialization languages can be the same, or be different.

Keep in mind that protocol buffers are not self describing, so both sides of your application needs to have serializers/deserializers generated from the .proto file.

answered Jun 13 '13 at 08:15

Matthieu Rouget

3,289
18
23

They are by default non-describing but it is trivial to wrap a message inside another message that provides the self-documentation. https://developers.google.com/protocol-buffers/docs/techniques#self-description – g19fanatic Jun 14 '13 at 13:09
@g19fanatic: Never needed to have self-describing messages, but it is interesting. – Matthieu Rouget Jun 14 '13 at 14:00

score 1 · Answer 3 · edited May 23 '17 at 11:57

In short: yes you can.

You will need to create .proto files which define the data structures that you want to share. By using the Google Protocol Buffers compiler you can then generate interfaces and (de)serialization code for your structures for both Java and C++ (and almost any other language you can think of).

To transfer your data over the wire you can use for instance ZeroMQ which is an extremely versatile communications framework which also sports a slew of different language API's, among them Java and C++.

See this question for more details.

Google ProtoBuf serialization / deserialization

3 Answers3