Re-encode a protobuf with protoc and no schema

Question

I'm playing with an app and trying to reverse engineer data files that it can export and import. The files are protobufs in binary. My goal is to be able to export a file, convert to text, modify it with additional data records, re-encode to binary, and reimport it as a way to bypass tedious manual input of data into the app. I have used a protoc binary on my windows machine with --decode_raw and can produce nicely readable hierarchical data without knowing the actual .proto schema used. Using Marc Gravell's parser gives similar results (with some ambiguities I don't quite understand.) My questions are the following:

Is there an easy way to re-encode the output of --decode_raw to produce the original binary, either using protoc or another tool? I understand that the raw decode is making assumptions about the unknown schema, and so far it looks like those assumptions work ok to make intelligible results. Is there a loss of data on the raw decode that would prevent re-encoding to the original? Is it just that the protoc developers didn't see a need to have this feature? With this capability, I could modify the text and re-encode, and have a decent chance of generating a valid binary.
If #1 is no go, given the raw output how do I create a .proto file and text message input file to re-encode the original binary using protoc --encode? I would appreciate a pointer to sample text files that could be used as command line input to protoc for me to play with to learn the needed syntax. The sample stuff I've seen all appears geared towards using protoc to generate source code. The binary protobufs I tested have decoded to strings, ints and a few hex values (which I still need to decipher) which correspond well to the data visible in the app, so I have confidence that I can make the required schema if I see working examples.

Some preferences: I'm tinkering on my phone and my windows laptop, and would rather not need to install python or another programming platform. I'd just like to use protoc on the command line, and my text/hex editor.

Thanks for any help.

[Edit: I've located a web page that gives sample input, which gave me the clues I needed to make some progress. The page is https://medium.com/@at_ishikawa/cli-to-generate-protocol-buffers-c2cfdf633dce, so thanks to @at_ishikawa for taking the time to make it. With the examples, I understand how to format a message file to generate a binary. However, it looks like the binary I'm trying to decode may not be amenable to the command line. See my new question below.]

New question: I still have the goal of decoding the binary into a text message, editing the text message to add more data records, and re-encoding the modified text message to make a new binary that will hopefully be succesfully imported by the app. Using --decode_raw, I can see that my binary file has the following format:


    1 {
      1: "ThisItem:name1"
      2 {

        1: "name1"
        2: <string>
        4: <string>
        5: 1
      }
    }
    
    1 {
      1: "ThisItem:name2"
      2 {
        1: "name2"
        2: <string>
        4: <string>
        5: 1
      }
    }
    
    1 {
      1: "ThatItem:name1"
      2 {
        1: "name1"
        3: <string>
        5: <data structure>
        8: <string>
      }
    }
    
    1 {
      1: "ThatItem:name2"
      2 {
        1: "name2"
        3: <string>
        5: <data structure>
        8: <string>
      }
    }
    
    1 {
      1: "ThisItem:name3"
      2 {
        1: "name3"
        2: <string>
        4: <string>
        5: 1
      }
    }

So I see several characteristics of the data structure:

The file is a concatenation of many records, each with the same field number 1. Therefore they need to use the same message format.
Each record has a sub-message in field number 2, but those messages have two different formats.
Each record has a field number 1 which is a prefix "ThisItem" or "ThatItem" which apparently identifies the message type for field 2 and a suffix that matches the first string in the "inner" message.

I can then make a .proto file to almost support this structure:


    syntax = "proto3";
    
    message RecordList {
        repeated Record records = 1;
    }
    
    message Record {
        string id = 1;
        ThisItem item = 2;
        ThatItem item = 2;  //  Problem here, each record uses field 2, but with different message types.
    //  Each record has either a ThisItem or ThatItem.  Parsing the id field could tell which,
    //  but that doesn't appear possible with protoc on the command line.
    }
    
    message ThisItem {
        string id = 1;
        string <element2> = 2;
        string <element4> = 4;
        int32 <element5> = 5;
    }
    
    message ThatItem {
        string id = 1;
        string <element3> = 3;
        <message type> <element5> = 5;
        string <element8> = 8;
    }

So I'm not sure if there is a way to decode/encode this binary on the command line. Is there some syntax I can use for the Record message to switch between the two possible choices for field 2 by parsing the string in field 1? If not, I will need to read and parse the records in a program, which is what I wanted to avoid.

One other possibility that I've realized: Instead of two different sub-messages ThisItem and ThatItem, I could use one sub-message and skip unused fields. The sub-message would populate fields 1, 2, 4 and 5 in one case, and fields 1, 3, 5 and 8 in the other case. The difficulty is field 5, which is the integer 1 in one case and a data structure in the other case. I'm not sure how to manage that. Is the integer 1 the binary encoding of an empty message?

Thanks for any help.

score 2 · Answer 1 · answered May 22 '21 at 20:06

Let me see if I can help answer this question.

1. Reencode Using protoc

protoc --decode_raw is a dead end road, you cannot use protoc to encode later. This is because there is no such thing as --encode_raw. You cannot have a proto file with messages named 1,2,3 etc... it does not work. However if can set up a schema to match the data you can feed protoc to encode or decode easily.

Example Text File with RecordList data to encode

This is text used to encode a message, I have saved this into a file named message for my example below.

records{
      id: '1'
      item{
      id: '1.1'
      element2: 'e2'
      element4: 'e4'
      element5: 5
      }
}
records{
      id: '2'
      item{
      id: '2.1'
      element2: 'e2'
      element4: 'e4'
      element5: 5
      }
}

Version Of Your Proto File I Tested With

I named this file test.proto in example below.

syntax = "proto3";

message RecordList {
    repeated Record records = 1;
}

message Record {
    string id = 1;
    ThisItem item = 2;
}

message ThisItem {
    string id = 1;
    string element2 = 2;
    string element4 = 4;
    int32 element5 = 5;
}

Lets do an example with encoding, I am assuming a Linux environment

# Encode message and then decode it with our schema
protoc --encode="RecordList" --proto_path= ./test.proto < message | protoc --decode="RecordList" --proto_path= ./test.proto

Output:

records {
  id: "1"
  item {
    id: "1.1"
    element2: "e2"
    element4: "e4"
    element5: 5
  }
}
records {
  id: "2"
  item {
    id: "2.1"
    element2: "e2"
    element4: "e4"
    element5: 5
  }
}

At this point you could modify this output to have different values that you want and encode it. Using a hex dump tool like hd or xxd can be very useful as well!

# Send output from protoc decoding to be encoded again with different 
# Lets say you saved the output to out.txt 
protoc --encode="RecordList" --proto_path= ./test.proto < out.txt

# You can always decode your output to see the formatted parsed version
protoc --encode="RecordList" --proto_path= ./test.proto < out.txt | protoc --decode="RecordList" --proto_path= ./test.proto

Question 2 create a schema and use protoc --encode

I would like to see the raw output of your --decode_raw command, even seeing the payload as a hex string like 08-24-AF-22 would be helpful. I can't really suggest anything on the variance of the message types without seeing that.

For example I know that negative numbers of type int32 will show up as unsigned large integers with --decode_raw. I am not sure what may be going on with your case without seeing the raw data more closely.

But yes I would recommend tying to combine the messages ThisThing and ThatThing. Not including every message field in every message is very common.

If you have not found this online decoder by Marc Gravell then be sure to try it. It is essentially protoc --decode_raw with more detail.

Thanks for confirming that re-encoding without a schema was not possible. I came to the conclusion that the only way to attack the encoding problem was programmatically. More info below. — JustinB, May 24 '21 at 02:58

score 1 · Answer 2 · answered May 24 '21 at 03:45

Since it looked like protoc on the command line couldn't do what I wanted, I turned to writing a program. The easiest path for me was to install python, since the learning curve didn't look too steep and I could build a script bit by bit. The key for the data structure turned out to be replacing this part of my hypothetical .proto file:

message Record {
    string id = 1;
    ThisItem item = 2;
    ThatItem item = 2;  //  Problem here, each record uses field 2, but with different message types.
//  Each record has either a ThisItem or ThatItem.  Parsing the id field could tell which,
//  but that doesn't appear possible with protoc on the command line.
}

with a generalized form:

message Record {
    optional string id = 1;
    oneof datafields {
        bytes data = 2;
        ThisItem thisitem = 3;
        ThatItem thatitem= 4;
    }
}

The protobuf binary only uses the general bytes data structure, which is why protoc with --decode_raw shows all the data using the field number of 2. The data field can then be a container for ThisItem or ThatItem as necessary. Those two structures are also included as possible datafields so that the program record structure can accommodate them for programmatic manipulation.

Here is sample code for python, where the .proto file is myschema.proto, defined as shown in my question above with the :

import myschema_pb2
from google.protobuf import text_format

### Read objects from PB and load into RecordList
mylist=myschema_pb2.RecordList()
f=open('objects.pb','rb')
mylist.ParseFromString(f.read())
f.close()
    
### Parse general data into ThisItem or ThatItem
for rec in mylist.records:
  bin1 = rec.data
  ss=rec.id
  itemID=ss[0:ss.find(':')]
  if itemID == 'ThisItem':
    rec.thisitem.ParseFromString(rec.data) # parses data into thisitem and clears data
  elif itemID == 'ThatItem':
    rec.thatitem.ParseFromString(rec.data) # parses data into thatitem and clears data
  else:
    print('unknown')

Thisitem and thatitem can then be manipulated as needed. When it's time to write the protobuf file they are converted back into the general data format:

### Generalize ThisItem and ThatItem into data
for rec in newlist.records:
  ss=rec.id
  itemID=ss[0:ss.find(':')]
  if itemID == 'ThisItem':
    rec.data=rec.thisitem.SerializeToString()
  elif itemID == 'ThatItem':
    rec.data=rec.thatitem.SerializeToString()
  else:
    print('unknown')

Note again, this structure is just peculiar to the protobuffer I've been working with. I'm not sure why the developer decided to do it like this, rather than write thisitem and thatitem to the binary. As far as I know, all it changes is the field number, 2, 3 or 4.