0

I'm working on writing a pure JS thrift decoder that doesn't depend on thrift definitions. I have been following this handy guide which has been my bible for the past few days: https://erikvanoosten.github.io/thrift-missing-specification/

I almost have my parser working, but there is a string type that throws a wrench into the program, and I don't quite understand what it's doing. Here is an excerpt of the hexdump, which I did my best to annotate:

Correctly parsing:

000001a0  0a 32 30 32 31 2d 31 31  2d 32 34 16 02 00 18 07  |.2021-11-24.....|
........................blah blah blah............|  |  |
                                       Object End-|  |  |
                           0x18 & 0xF = 0x8 = Binary-|  |
             The binary sequence is 0x7 characters long-|
000001b0  53 65 61 74 74 6c 65 18  02 55 53 18 02 55 53 18  |Seattle..US..US.|
          S  E  A  T  T  L  E  |___|  U  S  |___| U  S
    Another string, 2 bytes long |------------|

So far so good.

But then I get to this point: There string I am trying to extract is "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4592.0 Safari/537.36 Edg/94.0.975.1" and is 134 bytes long.

000001c0  09 54 61 68 6f 65 2c 20  43 41 12 12 00 00 08 c8  |.Tahoe, CA......|
                                 Object ends here-|  |  |
                           0x8 & 0xF = 0x8 = Binary -|  |
                                  0xc8 bytes long (200)-|
000001d0  01 86 01 4d 6f 7a 69 6c  6c 61 2f 35 2e 30 20 28  |...Mozilla/5.0 (|
          |  |  |  M  o  z  i  l   l  a  
        ???? |--|-134, encoded as var-int
000001e0  4d 61 63 69 6e 74 6f 73  68 3b 20 49 6e 74 65 6c  |Macintosh; Intel|

As you can see, I have a byte sequence 0x08 0xC8 0x01 0x86 0x01 which contains the length of the string I'm looking for, is followed by the string I'm looking for but has 3 extra bytes that are unclear in purpose.

The 0x01 is especially confusing as it neither a type identifier, nor seems to have a concrete value.

What am I missing?

Slava Knyazev
  • 5,377
  • 1
  • 22
  • 43

2 Answers2

0

Thrift supports pluggable serialization schemes. In tree you have binary, compact and json. Out of tree anything goes. From the looks of it you are trying to decode compact protocol, so I'll answer accordingly.

Everything sent and everything returned in a Thrift RPC call is packaged in a struct. Every field in a struct has a 1 byte type and a 2 byte field ID prefix. In compact protocol field ids, when possible, are delta encoded into the type and all ints are compressed down to just the bits needed to store them (and some flags). Because ints can now take up varying numbers of bytes we need to know when they end. Compact protocol encodes the int bits in 7 bits of a byte and sets the high order bit to 1 if the next byte continues the int. If the high order bit is 0 the int is complete. Thus the int 5 (101) would be encoded in one byte as 0000101. Compact knows this is the end of the int because the high order bit is 0.

In your case, the int 134 (binary 10000110) will need 2 bytes to encode because it is more than 7 bits. The fist 7 bits are stored in byte 1 with the 0x80 bit set to flag "the int continues". The second and final byte encodes the last bit (00000001). What you thought was 134 was just the encoding of the first seven bits. The stray 1 was the final bit of the 134.

I'd recommend you use the in tree source to do any needed protocol encoding/decoding. It's already written and tested: https://github.com/apache/thrift/blob/master/lib/nodejs/lib/thrift/compact_protocol.js

codeSF
  • 1,162
  • 9
  • 16
  • Alright I see. So that explains the `0x86 0x01` portion as `(0x86 & 0x7F) + (0x01 << 7) = 134`. However, this does not explain the preceding `0x08 0xC8 0x01` bytes which to my understanding still means "Binary sequence, 200 bytes ahead" – Slava Knyazev Oct 21 '21 at 17:46
  • It's hard to tell exactly what is going on because I don't know what the original message is nor do I have the complete encoded message, my guess, as stated in the answer, is that the 0x08 is the type and the 0xc8 0x01 is the field ID (not the len). Take a look at the link. It's the Javascipt code used to do this exact encoding (you can use it and it will work or you can read it and understand all of the subtleties, e.g. I mentioned delta encoding). – codeSF Oct 22 '21 at 20:11
  • I thought the field id is limited to 4 bits. But it looks like it's either 4 bits, as part of the type identifier, or if that's 0, then its the next 16 bits. This makes sense. – Slava Knyazev Oct 22 '21 at 20:37
  • 1
    The field id is delta encoded when possible (as mentioned in the answer above) and if it cannot be packed with the type it is saved directly and compressed (like all other ints, thus not necessarily 2 bytes as mentioned in the answer above). It is enticing to think there's a short cut that will handle decoding but the the code in the link I provided is as short as its going to get. If you want to reliably decode Thrift compact protocol you should use that code or understand it. – codeSF Oct 24 '21 at 14:37
-2

The byte sequence reads as follows

  • 0x08: String type, the next 2 bytes define the elementId
  • 0xC8 0x01: ElementId, encoded in 16 bits
  • 0x86 0x01: String length, encoded as var int

It turns out that if the type identifier does not contain bits defining the elementId, the elementId will be stored in the next 2 bytes.

Slava Knyazev
  • 5,377
  • 1
  • 22
  • 43
  • That is an incomplete description at best. What happens when you hit a bool in your decoder? What happens when you hit a negative number? What happens when the id can be encoded in 1 byte. – codeSF Oct 24 '21 at 14:39
  • @codeSF The question wasn't about how to decode bools or negative numbers, it was only about the `0x08 0xC8 0x01 0x86 0x01` sequence. I had everything else figured out. – Slava Knyazev Oct 24 '21 at 16:51