Detect UTF-16 file content

Question

Is it possible to know if a file has Unicode (16-byte per char) or 8-bit ASCII content?

score 10 · Answer 1 · answered Nov 21 '09 at 14:36

10

You may be able to read a byte-order-mark, if the file has this present.

answered Nov 21 '09 at 14:36

Brian Agnew

268,207
37
334
440

score 5 · Answer 2 · answered Mar 23 '16 at 21:36

UTF-16 characters are all at least 16-bits, with some being 32-bits with the right prefix (0xE000 to 0xFFFF). So simply scanning each char to see if less than 128 won't work. For example, the two bytes 0x20 0x20 would encode in ASCII and UTF-8 for two spaces, but encode in UTF-16 for a single character 0x2020 (dagger). If the text is known to be English with the occasional non-ASCII character, then most every other byte will be zero. But without some apriori knowledge about the text and/or it's encoding, there is no reliable way distinguish a general ASCII string from a general UTF-16 string.

score 4 · Accepted Answer · edited Oct 03 '13 at 13:10

4

Ditto to what Brian Agnew said about reading the byte order mark, a special two bytes that might appear at the beginning of the file.

You can also know if it is ASCII by scanning every byte in the file and seeing if they are all less than 128. If they are all less than 128, then it's just an ASCII file. If some of them are more than 128, there is some other encoding in there.

edited Oct 03 '13 at 13:10

Brian Agnew

268,207
37
334
440

answered Nov 23 '09 at 06:54

David Grayson

84,103
24
152
189

kdgregory · Answer 4 · 2009-11-21T18:29:39.410

First off, ASCII is 7-bit, so if any byte has its high bit set you know the file isn't ASCII.

The various "common" character sets such as ISO-8859-x, Windows-1252, etc, are 8-bit, so if every other byte is 0, you know that you're dealing with Unicode that only uses the ISO-8859 characters.

You'll run into problems where you're trying to distinguish between Unicode and some encoding such as UTF-8. In this case, almost every byte will have a value, so you can't make an easy decision. You can, as Pascal says do some sort of statistical analysis of the content: Arabic and Ancient Greek probably won't be in the same file. However, this is probably more work than it's worth.

Edit in response to OP's comment:

I think that it will be sufficient to check for the presence of 0-value bytes (ASCII NUL) within your content, and make the choice based on that. The reason being that JavaScript keywords are ASCII, and ASCII is a subset of Unicode. Therefore any Unicode representation of those keywords will consist of one byte containing the ASCII character (low byte), and another containing 0 (the high byte).

My one caveat is that you carefully read the documentation to ensure that their use of the word "Unicode" is correct (I looked at this page to understand the function, did not look any further).

I have to choose between JS_CompileScript() and JS_CompileUCScript() to compile JavaScript files for my native embedding (http://code.google.com/p/jslibs) — Franck Freiburger, Nov 21 '09 at 15:38

score 1 · Answer 5 · answered Nov 21 '09 at 14:46

If the file for which you have to solve this problem is long enough each time, and you have some idea what it's supposed to be (say, English text in unicode or English text in ASCII), you can do a simple frequency analysis on the chars and see if the distribution looks like that of ASCII or of unicode.

dottedmag · Answer 6 · 2014-12-08T23:10:48.257

1

Unicode is an alphabet, not a encoding. You probably meant UTF-16. There is lot of libraries around (python-chardet comes to mind instantly) to autodetect encoding of text, though they all use heuristics.

edited Dec 08 '14 at 23:10

answered Nov 21 '09 at 14:52

dottedmag

853
6
16

2

Unfortunately Microsoft have really confused this issue by consistently calling the UTF-16LE encoding “Unicode”. – bobince Nov 21 '09 at 15:27
1

Unicode is not an alphabet. It is an encoding, which encodes many alphabets. Think of it as a mapping from alphabets to a representation of those alphabets in digital form. – Victor Engel Jun 03 '17 at 18:37
1

Unicode is neither an alphabet nor an encoding, but a coded character set, offering multiple character encodings (UTF-8, UTF-16 and UTF-32). – Gustaf Liljegren Aug 23 '17 at 10:00
1

Shall I disagree one more time? It's not an alphabet, encoding, or a coded character set as ISO/EIC 10646 is, but a standard for encoding, handling, and representation of writing systems. In addition to the character set, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for right-to-left scripts such as Arabic and Hebrew. https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#Differences_from_Unicode – Victor Engel Dec 05 '17 at 00:21

score 1 · Answer 7 · answered Aug 28 '18 at 16:50

To programmatically discern the type of a file -- including, but not limited to the encoding -- the best bet is to use libmagic. BSD-licensed it is part of just about every Unix-system you are about to encounter, but for a lesser ones you can bundle it with your application.

Detecting the mime-type from C, for example, is as simple as:

Magic = magic_open(MAGIC_MIME|MAGIC_ERROR);

mimetype = magic_buffer(Magic, buf, bufsize);

Other languages have their own modules wrapping this library.

Back to your question, here is what I get from file(1) (the command-line interface to libmagic(3)):

% file /tmp/*rdp
/tmp/meow.rdp: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators

score 0 · Answer 8 · answered Nov 21 '09 at 17:25

0

For your specific use-case, it's very easy to tell. Just scan the file, if you find any NULL ("\0"), it must be UTF-16. JavaScript got to have ASCII chars and they are represented by a leading 0 in UTF-16.

answered Nov 21 '09 at 17:25

ZZ Coder

74,484
29
137
169

Detect UTF-16 file content

8 Answers8