Why isn't the Byte Order Mark emitted from UTF8Encoding.GetBytes?

Question

The snippet says it all :-)

UTF8Encoding enc = new UTF8Encoding(true/*include Byte Order Mark*/);
byte[] data = enc.GetBytes("a");
// data has length 1.
// I expected the BOM to be included. What's up?

Saying "the BOM isn't necessary for UTF-8" is simply inaccurate. The preamble is how applications distinguish between UTF8 and codepaged ANSI. — EricLaw, Feb 12 '14 at 21:13

score 18 · Accepted Answer · answered Jan 07 '09 at 16:06

18

You wouldn't want it to be used for every call to GetBytes, otherwise you'd have no way of (say) writing a file a line at a time.

By exposing it with GetPreamble, callers can insert the preamble just at the appropriate point (i.e. at the start of their data). I agree that the documentation could be a lot clearer though.

answered Jan 07 '09 at 16:06

Jon Skeet

1,421,763
867
9,128
9,194

In general, you should be able to ignore the preamble, since your writer will insert it based on your encoding choice. – Ishmael Jan 23 '09 at 19:17

score 9 · Answer 2 · answered Jan 07 '09 at 16:28

9

Thank you both. The following works, and LINQ makes the combination simple :-)

UTF8Encoding enc = new UTF8Encoding(true);
byte[] data = enc.GetBytes("a");
byte[] combo = enc.GetPreamble().Concat(data).ToArray();

answered Jan 07 '09 at 16:28

xyz

27,223
29
105
125

This is exactly what I'm doing. Note that `Encoding.UTF8` is a shorthand for `new UTF8Encoding(true)`, so your first line could be just `var enc = Encoding.UTF8;`, or in-line it to the other two, or even shrink the whole thing to a one-liner `var combo = Encoding.UTF8.GetPreamble().Concat(Encoding.UTF8.GetBytes("a")).ToArray();` Cheers. – Daniel Liuzzi Feb 25 '11 at 08:13

score 3 · Answer 3 · answered Jan 07 '09 at 16:07

3

Because it is expected that GetBytes() will be called lots of times... you need to use:

byte[] preamble = enc.GetPreamble();

(only call it at the start of a sequence) and write that; this is where the BOM lives.

answered Jan 07 '09 at 16:07

Marc Gravell

1,026,079
266
2,566
2,900

score 2 · Answer 4 · answered Jan 09 '09 at 14:07

2

Note that in general, you don't need the Byte Order Mark for UTF-8 anyway. It's main purpose is to tell UTF16 BE and UTF16 LE apart. There is no such thing as UTF8 LE and UTF8 BE.

answered Jan 09 '09 at 14:07

MSalters

173,980
10
155
350

3

It also allows you to differentiate UTF-8 files from ANSI files. – Ishmael Jan 23 '09 at 19:15
Even Microsoft admits "ANSI" is a confusing name - even when it's used to describe a charset. "ANSI files" don't exist anyway; on Windows all files are binary (Mainframes did have true text files, but they didn't have "Microsoft ANSI") – MSalters Feb 03 '09 at 14:34

Why isn't the Byte Order Mark emitted from UTF8Encoding.GetBytes?

4 Answers4