Remove UTF-16 BOM from string with Perl

Question

I'm looking for the correct syntax to remove the BOM from a UTF-16 text file I have successfully done it for UTF-8. Please see below for syntax I have tried:

$readline =~ s/^\N{ZERO WIDTH NO-BREAK SPACE}//;
$readline =~ s/^\N{BYTE ORDER MARK}//;
$readline =~ s/^\N{BOM}//;
$readline =~ s/^\x{FEFF}//;
$readline =~ s/^\0x{FEFF}//;
$readline =~ s/^\x{FE}\x{FF}//;
$readline =~ s/^\xFE\xFF//;
$readline =~ s/^\0xFE\0xFF//;

As you can see these are repetitive but I was trying anything I could find. To open the file I used the encoding function. Any help would be greatly appreciated.

Possible duplicate of [Remove BOM from string with Perl](http://stackoverflow.com/questions/24390034/remove-bom-from-string-with-perl) — DavidO, Feb 24 '17 at 23:43
Also http://www.larshaendler.com/2014/03/14/remove-bom-while-reading-file-with-perl/ — DavidO, Feb 24 '17 at 23:46
This cannot be answered unless you tell us whether you have bytes or characters first. — tchrist, Feb 26 '17 at 23:19

score 5 · Answer 1 · answered Feb 25 '17 at 00:02

5

What's in $readline?

If you have UTF-16be,

s/^\xFE\xFF//

If you have UTF-16le,

s/^\xFF\xFE//

If you have Unicode Code Points (decoded text),

s/^\x{FEFF}//
s/^\N{BOM}//

Alternatively, you can also use File::BOM to both remove the mark and decode the stream.

answered Feb 25 '17 at 00:02

ikegami

367,544
15
269
518

+1, though if I may be a bit pedantic, I think that "UTF-16BE" and "UTF-16LE" are really supposed to refer to the two versions of UTF-16 that do *not* allow a byte order mark. The UTF-16 that *does* allow a BOM, and can therefore have either endianness (defaulting to BE if there's no BOM), is supposed to be called just "UTF-16". (See http://unicode.org/faq/utf_bom.html#gen7.) – ruakh Feb 26 '17 at 23:28
I agree that "UTF-16 that's encoded in little-endian byte order" is too long-winded, but I don't know why you suggest that as the alternative. What's wrong with "little-endian UTF-16"? – ruakh Feb 26 '17 at 23:49
Actually, nothing in the passage you linked says that UTF-16le and UTF-16be can't have a BOM. (It just says that UTF-16 may have one.) – ikegami Feb 26 '17 at 23:54
Right, because that's just a convenient FAQ, not a formal spec. For the latter, see [RFC 2781](https://tools.ietf.org/html/rfc2781), which specifies that "Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text" and so on. – ruakh Feb 27 '17 at 00:01
That passage explains what it means when an application says it expects "UTF-16le" or "UTF-16be". It's not relevant here. Read on to 4.1. A BOM can still lead UTF-16le and UTF-16be. It just doesn't effect how the stream is processed. – ikegami Feb 27 '17 at 00:11
In other words, if I say I require UTF-16le, and you produce something starting with `FE FF`, it's still going to be treated as UTF-16le. The usage in my answer is therefore in line with the RFC. It's technically the OP in error by calling it a BOM instead of a ZWNBS – ikegami Feb 27 '17 at 00:38
I think you have that backward. It's generally the producer, not the consumer, that specifies the character-set (e.g. as part of an HTTP `Content-Type` header). So if I send a document marked as `UTF16-LE` or `UTF-16BE`, then the RFC is saying that I *must not* include a BOM. – ruakh Feb 27 '17 at 01:06
Makes no diff. Like I said, if it's not a BOM, then it's a ZWNBS, and the answer remains the same. – ikegami Feb 27 '17 at 20:59
But then how did you decide that it's not a BOM? Nothing in the original question suggests that the OP is using "UTF-16BE" or "UTF-16LE". – ruakh Feb 27 '17 at 22:05
@ruakh, The question isn't encoding detection. The OP didn't ask to detect if they have UTF-16le + BOM or UTF-16be + BOM, so they must therefore know which one they have. In fact, the OP tagged the question UTF-16le (subsequently removed by tchrist). If they need BOM-based encoding detection, they should use File::BOM as I mentioned, or just `open(my $fh, '<:encoding(UTF-16)', ...)` – ikegami Feb 27 '17 at 22:14
Ah, OK. Fair enough. – ruakh Feb 27 '17 at 22:48

Remove UTF-16 BOM from string with Perl

1 Answers1