5

I'm working on a parser to receive UDP information, parse it, and store it. To do so I'm using a BinaryReader since it will mostly be binary information. Some of it will be strings though. MSDN says for the ReadString() function:

Reads a string from the current stream. The string is prefixed with the length, encoded as an integer seven bits at a time.

And I completely understand it up until "seven bits at a time" which I tried to simply ignore until I started testing. I'm creating my own byte array before putting it into a MemoryStream and attempting to read it with a BinaryReader. Here's what I first thought would work:

byte[] data = new byte[] { 3, 0, 0, 0, (byte)'C', (byte)'a', (byte)'t', }
BinaryReader reader = new BinaryReader(new MemoryStream(data));
String str = reader.ReadString();

Knowing an int is 4 bytes (and toying around long enough to find out that BinaryReader is Little Endian) I pass it the length of 3 and the corresponding letters. However str ends up holding \0\0\0. If I remove the 3 zeros and just have

byte[] data = new byte[] { 3, (byte)'C', (byte)'a', (byte)'t', }

Then it reads and stores Cat properly. To me this conflicts with the documentation saying that the length is supposed to be an integer. Now I'm beginning to think they simply mean a number with no decimal place and not the data type int. Does this mean that a BinaryReader can never read a string larger than 127 characters (since that would be 01111111 corresponding to the 7 bits part of the documentation)?

I'm writing up a protocol and need to completely understand what I'm getting into before I pass our documentation along to our clients.

Corey Ogburn
  • 24,072
  • 31
  • 113
  • 188
  • 2
    BinaryReader is designed to read things in that were written out with BinaryWriter. So try writing out different length strings with a BinaryWriter and you should be able to figure out the protocol. – President James K. Polk Oct 31 '13 at 15:36
  • But you'd better find out how that UDP protocol sends you the data, when it doesn't prefix the string (and that's the most likely) this is all in vain. – H H Oct 31 '13 at 15:59
  • http://msdn.microsoft.com/en-us/library/dd946975%28v=office.12%29.aspx – Ralf Oct 31 '13 at 16:00
  • 1
    I'm defining the protocol for my work and the code that's sending the data will most likely not be written in C# (probably python or C on linux) and thus won't have access to BinaryWriter. I'm using BinaryReader for code readability although I may ditch `ReadString` and use a solid 4 bytes for length and use `ReadChars` so it's easier to implement. – Corey Ogburn Oct 31 '13 at 16:04
  • I was wrong about the Encoding, it uses the "Writers current encoding". – H H Oct 31 '13 at 16:07

2 Answers2

7

I found the source code for BinaryReader. It uses a function called Read7BitEncodedInt() and after looking up that documentation and the documentation for Write7BitEncodedInt() I found this:

The integer of the value parameter is written out seven bits at a time, starting with the seven least-significant bits. The high bit of a byte indicates whether there are more bytes to be written after this one. If value will fit in seven bits, it takes only one byte of space. If value will not fit in seven bits, the high bit is set on the first byte and written out. value is then shifted by seven bits and the next byte is written. This process is repeated until the entire integer has been written.

Also, Ralf found this link that better displays what's going on.

Corey Ogburn
  • 24,072
  • 31
  • 113
  • 188
  • But what if the binary stream was written by a program running on a Big Endian machine and then read (for example if the file is transported) by another program on a Little Endian machine? I think they missed that. I am writing an application where I write the streams in Big Endian (Network Byte Order). – Lord of Scripts Apr 02 '16 at 00:27
2

Unless they specifically say 'int' or 'Int32', they just mean an integer as in a whole number.

By '7 bits at time', they mean that it implements 7-bit length encoding, which seems a bit confusing at first but is actually rather straightforward. Here are some example values and how they are written out using 7-bit length encoding:

/*
decimal value   binary value                ->  enc byte 1   enc byte 2   enc byte 3
85              00000000 00000000 01010101  ->  01010101     n/a          n/a
1,365           00000000 00000101 01010101  ->  11010101     00001010     n/a
349,525         00000101 01010101 01010101  ->  11010101     10101010     00010101
*/

The table above uses big endian for no other reason than I simply had to pick one and it's what I'm most familiar with. The way 7-bit length encoding works, it is little endian by it's very nature.

Note that 85 writes out to 1 byte, 1,365 writes out to 2 bytes, and 349,525 writes out to 3 bytes.

Here's the same table using letters to show how each value's bits were used in the written output (dashes are zero-value bits, and the 0s and 1s are what's added by the encoding mechanism to indicate if a subsequent byte is to be written/read)...

/*
decimal value   binary value                ->  enc byte 1   enc byte 2   enc byte 3
85              -------- -------- -AAAAAAA  ->  0AAAAAAA     n/a          n/a
1,365           -------- -----BBB AAAAAAAA  ->  1AAAAAAA     0---BBBA     n/a
349,525         -----CCC BBBBBBBB AAAAAAAA  ->  1AAAAAAA     1BBBBBBA     0--CCCBB
*/

So values in the range of 0 to 2^7-1 (127) will write out as 1 byte, values of 2^7 (128) to 2^14-1 (16,383) will use 2 bytes, 2^14 (16,384) to 2^21-1 (2,097,151) will take 3 bytes, and so on and so forth.

dynamichael
  • 807
  • 9
  • 9