13

The snippet says it all :-)

UTF8Encoding enc = new UTF8Encoding(true/*include Byte Order Mark*/);
byte[] data = enc.GetBytes("a");
// data has length 1.
// I expected the BOM to be included. What's up?
xyz
  • 27,223
  • 29
  • 105
  • 125
  • As said below, the BOM isn't necessary for UTF8. – jalf Jan 09 '09 at 14:13
  • 2
    Saying "the BOM isn't necessary for UTF-8" is simply inaccurate. The preamble is how applications distinguish between UTF8 and codepaged ANSI. – EricLaw Feb 12 '14 at 21:13

4 Answers4

18

You wouldn't want it to be used for every call to GetBytes, otherwise you'd have no way of (say) writing a file a line at a time.

By exposing it with GetPreamble, callers can insert the preamble just at the appropriate point (i.e. at the start of their data). I agree that the documentation could be a lot clearer though.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • In general, you should be able to ignore the preamble, since your writer will insert it based on your encoding choice. – Ishmael Jan 23 '09 at 19:17
9

Thank you both. The following works, and LINQ makes the combination simple :-)

UTF8Encoding enc = new UTF8Encoding(true);
byte[] data = enc.GetBytes("a");
byte[] combo = enc.GetPreamble().Concat(data).ToArray();
xyz
  • 27,223
  • 29
  • 105
  • 125
  • This is exactly what I'm doing. Note that `Encoding.UTF8` is a shorthand for `new UTF8Encoding(true)`, so your first line could be just `var enc = Encoding.UTF8;`, or in-line it to the other two, or even shrink the whole thing to a one-liner `var combo = Encoding.UTF8.GetPreamble().Concat(Encoding.UTF8.GetBytes("a")).ToArray();` Cheers. – Daniel Liuzzi Feb 25 '11 at 08:13
3

Because it is expected that GetBytes() will be called lots of times... you need to use:

byte[] preamble = enc.GetPreamble();

(only call it at the start of a sequence) and write that; this is where the BOM lives.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
2

Note that in general, you don't need the Byte Order Mark for UTF-8 anyway. It's main purpose is to tell UTF16 BE and UTF16 LE apart. There is no such thing as UTF8 LE and UTF8 BE.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 3
    It also allows you to differentiate UTF-8 files from ANSI files. – Ishmael Jan 23 '09 at 19:15
  • Even Microsoft admits "ANSI" is a confusing name - even when it's used to describe a charset. "ANSI files" don't exist anyway; on Windows all files are binary (Mainframes did have true text files, but they didn't have "Microsoft ANSI") – MSalters Feb 03 '09 at 14:34