1

I'm looking for the correct syntax to remove the BOM from a UTF-16 text file I have successfully done it for UTF-8. Please see below for syntax I have tried:

$readline =~ s/^\N{ZERO WIDTH NO-BREAK SPACE}//;
$readline =~ s/^\N{BYTE ORDER MARK}//;
$readline =~ s/^\N{BOM}//;
$readline =~ s/^\x{FEFF}//;
$readline =~ s/^\0x{FEFF}//;
$readline =~ s/^\x{FE}\x{FF}//;
$readline =~ s/^\xFE\xFF//;
$readline =~ s/^\0xFE\0xFF//;

As you can see these are repetitive but I was trying anything I could find. To open the file I used the encoding function. Any help would be greatly appreciated.

tchrist
  • 78,834
  • 30
  • 123
  • 180

1 Answers1

5

What's in $readline?

If you have UTF-16be,

s/^\xFE\xFF//

If you have UTF-16le,

s/^\xFF\xFE//

If you have Unicode Code Points (decoded text),

s/^\x{FEFF}//
s/^\N{BOM}//

Alternatively, you can also use File::BOM to both remove the mark and decode the stream.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • +1, though if I may be a bit pedantic, I think that "UTF-16BE" and "UTF-16LE" are really supposed to refer to the two versions of UTF-16 that do *not* allow a byte order mark. The UTF-16 that *does* allow a BOM, and can therefore have either endianness (defaulting to BE if there's no BOM), is supposed to be called just "UTF-16". (See http://unicode.org/faq/utf_bom.html#gen7.) – ruakh Feb 26 '17 at 23:28
  • I agree that "UTF-16 that's encoded in little-endian byte order" is too long-winded, but I don't know why you suggest that as the alternative. What's wrong with "little-endian UTF-16"? – ruakh Feb 26 '17 at 23:49
  • Actually, nothing in the passage you linked says that UTF-16le and UTF-16be can't have a BOM. (It just says that UTF-16 may have one.) – ikegami Feb 26 '17 at 23:54
  • Right, because that's just a convenient FAQ, not a formal spec. For the latter, see [RFC 2781](https://tools.ietf.org/html/rfc2781), which specifies that "Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text" and so on. – ruakh Feb 27 '17 at 00:01
  • That passage explains what it means when an application says it expects "UTF-16le" or "UTF-16be". It's not relevant here. Read on to 4.1. A BOM can still lead UTF-16le and UTF-16be. It just doesn't effect how the stream is processed. – ikegami Feb 27 '17 at 00:11
  • In other words, if I say I require UTF-16le, and you produce something starting with `FE FF`, it's still going to be treated as UTF-16le. The usage in my answer is therefore in line with the RFC. It's technically the OP in error by calling it a BOM instead of a ZWNBS – ikegami Feb 27 '17 at 00:38
  • I think you have that backward. It's generally the producer, not the consumer, that specifies the character-set (e.g. as part of an HTTP `Content-Type` header). So if I send a document marked as `UTF16-LE` or `UTF-16BE`, then the RFC is saying that I *must not* include a BOM. – ruakh Feb 27 '17 at 01:06
  • Makes no diff. Like I said, if it's not a BOM, then it's a ZWNBS, and the answer remains the same. – ikegami Feb 27 '17 at 20:59
  • But then how did you decide that it's not a BOM? Nothing in the original question suggests that the OP is using "UTF-16BE" or "UTF-16LE". – ruakh Feb 27 '17 at 22:05
  • @ruakh, The question isn't encoding detection. The OP didn't ask to detect if they have UTF-16le + BOM or UTF-16be + BOM, so they must therefore know which one they have. In fact, the OP tagged the question UTF-16le (subsequently removed by tchrist). If they need BOM-based encoding detection, they should use File::BOM as I mentioned, or just `open(my $fh, '<:encoding(UTF-16)', ...)` – ikegami Feb 27 '17 at 22:14
  • Ah, OK. Fair enough. – ruakh Feb 27 '17 at 22:48