-1

I have an issue with writing the character 56623 to a stream using a StreamWriter in UTF16 (the issue persists in other encodings as well). If I get the buffer from the stream, it contains the value 65533 instead of what I originally wrote. This issue snuck up on me when doing randomised unit tests and it does does not appear for value 60000 nor 95.

To illustrate, I have a minimal program to check the behaviour:

   char value = (char)56623;
   MemoryStream stream = new MemoryStream();
   StreamWriter writer = new StreamWriter(stream, Encoding.Unicode);
   writer.Write(value);
   writer.Close();

   var byteArray = BitConverter.GetBytes(value); // Reference bytes
   var buffer = writer.GetBuffer();

By reading byteArray and buffer I get:

   byteArray = [221,47] = 11011101 00101111 = 56623
   buffer = [255,254,253,255,...] = BOM 11111101 11111111 ... = BOM 65533

Thus, the written value 65533 is clearly not equal to the original 56623. However, when trying with the value 60000 the correct values are written:

   byteArray = [96,234] = 01100000 11101010 = 60000
   buffer = [255,254,96,234,...] = BOM 01100000 11101010 ... = BOM 60000

I fail to understand why this is the behaviour, but I am unwilling to think that there is an issue with the implementation of StreamWriter so there has the be something I am missing.

What is it that I am not seeing here?

Thank you!

Robert Kaufmann
  • 768
  • 1
  • 9
  • 21

1 Answers1

2

The problem is that 56623 is U+DD2F - which is a high surrogate UTF-16 code unit. It's invalid on its own - it's only valid as part of a surrogate pair used to encode code points which aren't in the Basic Multilingual Plane.

It should be fine if you write it as part of a valid surrogate pair (i.e. followed by a low surrogate) - but if you're trying to write it on its own, that suggests you've got invalid data to start with. You shouldn't be taking random UTF-16 code units and expecting them to be valid Unicode code points. You may be okay if you explicitly exclude U+D800 to U+DFFF inclusive, but even then you've got odd characters like a BOM which shouldn't occur within normal text.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Thank you! Thus it was clearly an issue with my test selection. Do you happen to have any tips regarding how one could produce valid random Unicode characters (and strings), or is perhaps it better to skip the randomisation in my tests (cost/value-wise)? – Robert Kaufmann Jun 06 '14 at 13:42
  • 1
    @Vectovox: I would consider picking ranges of reasonable characters - you might want to use http://www.unicode.org/charts/ to think about what ranges are sensible. – Jon Skeet Jun 06 '14 at 13:43