4

I'm parsing a file (which I don't generate) that contains a string. The string is always preceded by 2 bytes which tell me the length of the string that follows.

For example:

05 00 53 70 6F 72 74

would be:

Sport

Using a C# BinaryReader, I read the string using:

string s = new string(binaryReader.ReadChars(size));

Sometimes there's the odd funky character which seems to push the position of the stream on further than it should. For example:

0D 00 63 6F 6F 6B 20 E2 80 94 20 62 6F 6F 6B

Should be:

cook - book

and although it reads fine the stream ends up two bytes further along than it should?! (Which then messes up the rest of the parsing.)

I'm guessing it has something to do with the 0xE2 in the middle, but I'm not really sure why or how to deal with it.

Any suggestions greatly appreciated!

Bridgey
  • 529
  • 5
  • 15
  • 2
    After all the great answers, I solved it with: byte[] b = binRdr.ReadBytes(size); string s = Encoding.UTF8.GetString(b); In fact, this is gonna improve my parser no end! Thanks all! – Bridgey May 11 '11 at 21:27

3 Answers3

8

My guess is that the string is encoded in UTF-8. The 3-byte sequence E2 80 94 corresponds to the single Unicode character U+2014 (EM DASH).

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • Hey Ted, thanks for the answer. The character I expected is indeed EM DASH so that'll be UTF8 then. So reading the string with UTF8 encoding should hopefully solve it... – Bridgey May 11 '11 at 21:19
  • What's the source? I wonder because it's not all that common to have the length at the beginning. – Jonas Elfström May 11 '11 at 21:37
  • @Jonas - Actually, it's common to serialized strings as length-plus-data (although it's more common, I think, that the count refers to the number of bytes in the encoding, not the number of logical characters). It avoids problems of detecting the end of the string (not all environments use a `0` byte for this). – Ted Hopp May 11 '11 at 21:48
  • You are absolutely correct, it's just that I can't remember ever having seen it in a file. – Jonas Elfström May 12 '11 at 04:46
  • 1
    Yes, for clarity, the 2 preceding bytes are an Int16 which represent the number of bytes that make up the string, not the number of chars. – Bridgey May 12 '11 at 10:26
1

In your first example

05 00 53 70 6F 72 74

none of the bytes are over 0x7F and that happens to be the limit for 7 bit ASCII. UTF-8 retains compability with ASCII by using the 8th bit to indicate that there will be more information to come.

0D 00 63 6F 6F 6B 20 E2 80 94 20 62 6F 6F 6B

Just as Ted noticed your "problems" starts with 0xE2 because that is not a 7 bit ASCII character.

The first byte 0x0D tells us there should be 11 characters but there are 13 bytes.

0xE2 tells us that we've found the beginning of a UTF-8 sequence since the most significant bit is set (it's over 127). In this case a sequence that represents — (EM Dash).

Jonas Elfström
  • 30,834
  • 6
  • 70
  • 106
  • Awesome, thanks Jonas. Think you and Ted are both spot on. I'm still learning about Encoding but this is a lesson well learnt. – Bridgey May 11 '11 at 21:20
0

As you did correctly state the E2 character is the problem. BinaryReader.ReadChars(n) does not read n-bytes but n UTF-8 encoded Unicode characters. See Wikipedia for Unicode Encodings. The term you are after are Surrogate Characters. In UTF-8 characters in the range of 000080 – 00009F are represented by two bytes. This is the reason for your offset mismatch.

You need to use BinaryReader.ReadBytes to fix the offset issue and the pass it to an Encoding instance.

To make it work you need to read the bytes with BinaryReader and then decode it with the correct encoding. Assuming you are dealing with UTF-8 then you need to pass the byte array to

Encoding.UTF8.GetString(byte [] rawData)

to get your correctly encoded string back.

Yours, Alois Kraus

Alois Kraus
  • 13,229
  • 1
  • 38
  • 64
  • Thanks for the explanation Alois. With Ted and Joans's help I'd got there, but your 'number of bytes' explanation definitely helps. – Bridgey May 11 '11 at 21:27
  • surrogate characters don't really have anything to do with this. Those are a Unicode technique for representing Unicode code points above U+FFFF (the supplementary planes) as two Unicode characters in the BMP. – Ted Hopp May 12 '11 at 06:02