I'm playing with an app and trying to reverse engineer data files that it can export and import. The files are protobufs in binary. My goal is to be able to export a file, convert to text, modify it with additional data records, re-encode to binary, and reimport it as a way to bypass tedious manual input of data into the app. I have used a protoc
binary on my windows machine with --decode_raw
and can produce nicely readable hierarchical data without knowing the actual .proto
schema used. Using Marc Gravell's parser gives similar results (with some ambiguities I don't quite understand.) My questions are the following:
- Is there an easy way to re-encode the output of
--decode_raw
to produce the original binary, either usingprotoc
or another tool? I understand that the raw decode is making assumptions about the unknown schema, and so far it looks like those assumptions work ok to make intelligible results. Is there a loss of data on the raw decode that would prevent re-encoding to the original? Is it just that theprotoc
developers didn't see a need to have this feature? With this capability, I could modify the text and re-encode, and have a decent chance of generating a valid binary. - If #1 is no go, given the raw output how do I create a
.proto
file and text message input file to re-encode the original binary usingprotoc --encode
? I would appreciate a pointer to sample text files that could be used as command line input toprotoc
for me to play with to learn the needed syntax. The sample stuff I've seen all appears geared towards usingprotoc
to generate source code. The binary protobufs I tested have decoded to strings, ints and a few hex values (which I still need to decipher) which correspond well to the data visible in the app, so I have confidence that I can make the required schema if I see working examples.
Some preferences: I'm tinkering on my phone and my windows laptop, and would rather not need to install python or another programming platform. I'd just like to use protoc on the command line, and my text/hex editor.
Thanks for any help.
[Edit: I've located a web page that gives sample input, which gave me the clues I needed to make some progress. The page is https://medium.com/@at_ishikawa/cli-to-generate-protocol-buffers-c2cfdf633dce, so thanks to @at_ishikawa for taking the time to make it. With the examples, I understand how to format a message file to generate a binary. However, it looks like the binary I'm trying to decode may not be amenable to the command line. See my new question below.]
New question: I still have the goal of decoding the binary into a text message, editing the text message to add more data records, and re-encoding the modified text message to make a new binary that will hopefully be succesfully imported by the app. Using --decode_raw, I can see that my binary file has the following format:
1 {
1: "ThisItem:name1"
2 {
1: "name1"
2: <string>
4: <string>
5: 1
}
}
1 {
1: "ThisItem:name2"
2 {
1: "name2"
2: <string>
4: <string>
5: 1
}
}
1 {
1: "ThatItem:name1"
2 {
1: "name1"
3: <string>
5: <data structure>
8: <string>
}
}
1 {
1: "ThatItem:name2"
2 {
1: "name2"
3: <string>
5: <data structure>
8: <string>
}
}
1 {
1: "ThisItem:name3"
2 {
1: "name3"
2: <string>
4: <string>
5: 1
}
}
So I see several characteristics of the data structure:
- The file is a concatenation of many records, each with the same field number 1. Therefore they need to use the same message format.
- Each record has a sub-message in field number 2, but those messages have two different formats.
- Each record has a field number 1 which is a prefix "ThisItem" or "ThatItem" which apparently identifies the message type for field 2 and a suffix that matches the first string in the "inner" message.
I can then make a .proto file to almost support this structure:
syntax = "proto3";
message RecordList {
repeated Record records = 1;
}
message Record {
string id = 1;
ThisItem item = 2;
ThatItem item = 2; // Problem here, each record uses field 2, but with different message types.
// Each record has either a ThisItem or ThatItem. Parsing the id field could tell which,
// but that doesn't appear possible with protoc on the command line.
}
message ThisItem {
string id = 1;
string <element2> = 2;
string <element4> = 4;
int32 <element5> = 5;
}
message ThatItem {
string id = 1;
string <element3> = 3;
<message type> <element5> = 5;
string <element8> = 8;
}
So I'm not sure if there is a way to decode/encode this binary on the command line. Is there some syntax I can use for the Record message to switch between the two possible choices for field 2 by parsing the string in field 1? If not, I will need to read and parse the records in a program, which is what I wanted to avoid.
One other possibility that I've realized: Instead of two different sub-messages ThisItem
and ThatItem
, I could use one sub-message and skip unused fields. The sub-message would populate fields 1, 2, 4 and 5 in one case, and fields 1, 3, 5 and 8 in the other case. The difficulty is field 5, which is the integer 1 in one case and a data structure in the other case. I'm not sure how to manage that. Is the integer 1 the binary encoding of an empty message?
Thanks for any help.