9

Is it possible to know if a file has Unicode (16-byte per char) or 8-bit ASCII content?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Franck Freiburger
  • 26,310
  • 20
  • 70
  • 95

8 Answers8

10

You may be able to read a byte-order-mark, if the file has this present.

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
5

UTF-16 characters are all at least 16-bits, with some being 32-bits with the right prefix (0xE000 to 0xFFFF). So simply scanning each char to see if less than 128 won't work. For example, the two bytes 0x20 0x20 would encode in ASCII and UTF-8 for two spaces, but encode in UTF-16 for a single character 0x2020 (dagger). If the text is known to be English with the occasional non-ASCII character, then most every other byte will be zero. But without some apriori knowledge about the text and/or it's encoding, there is no reliable way distinguish a general ASCII string from a general UTF-16 string.

Greg Young
  • 543
  • 1
  • 6
  • 11
4

Ditto to what Brian Agnew said about reading the byte order mark, a special two bytes that might appear at the beginning of the file.

You can also know if it is ASCII by scanning every byte in the file and seeing if they are all less than 128. If they are all less than 128, then it's just an ASCII file. If some of them are more than 128, there is some other encoding in there.

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
David Grayson
  • 84,103
  • 24
  • 152
  • 189
2

First off, ASCII is 7-bit, so if any byte has its high bit set you know the file isn't ASCII.

The various "common" character sets such as ISO-8859-x, Windows-1252, etc, are 8-bit, so if every other byte is 0, you know that you're dealing with Unicode that only uses the ISO-8859 characters.

You'll run into problems where you're trying to distinguish between Unicode and some encoding such as UTF-8. In this case, almost every byte will have a value, so you can't make an easy decision. You can, as Pascal says do some sort of statistical analysis of the content: Arabic and Ancient Greek probably won't be in the same file. However, this is probably more work than it's worth.


Edit in response to OP's comment:

I think that it will be sufficient to check for the presence of 0-value bytes (ASCII NUL) within your content, and make the choice based on that. The reason being that JavaScript keywords are ASCII, and ASCII is a subset of Unicode. Therefore any Unicode representation of those keywords will consist of one byte containing the ASCII character (low byte), and another containing 0 (the high byte).

My one caveat is that you carefully read the documentation to ensure that their use of the word "Unicode" is correct (I looked at this page to understand the function, did not look any further).

kdgregory
  • 38,754
  • 10
  • 77
  • 102
  • I have to choose between JS_CompileScript() and JS_CompileUCScript() to compile JavaScript files for my native embedding (http://code.google.com/p/jslibs) – Franck Freiburger Nov 21 '09 at 15:38
1

If the file for which you have to solve this problem is long enough each time, and you have some idea what it's supposed to be (say, English text in unicode or English text in ASCII), you can do a simple frequency analysis on the chars and see if the distribution looks like that of ASCII or of unicode.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
1

Unicode is an alphabet, not a encoding. You probably meant UTF-16. There is lot of libraries around (python-chardet comes to mind instantly) to autodetect encoding of text, though they all use heuristics.

dottedmag
  • 853
  • 6
  • 16
  • 2
    Unfortunately Microsoft have really confused this issue by consistently calling the UTF-16LE encoding “Unicode”. – bobince Nov 21 '09 at 15:27
  • 1
    Unicode is not an alphabet. It is an encoding, which encodes many alphabets. Think of it as a mapping from alphabets to a representation of those alphabets in digital form. – Victor Engel Jun 03 '17 at 18:37
  • 1
    Unicode is neither an alphabet nor an encoding, but a coded character set, offering multiple character encodings (UTF-8, UTF-16 and UTF-32). – Gustaf Liljegren Aug 23 '17 at 10:00
  • 1
    Shall I disagree one more time? It's not an alphabet, encoding, or a coded character set as ISO/EIC 10646 is, but a standard for encoding, handling, and representation of writing systems. In addition to the character set, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for right-to-left scripts such as Arabic and Hebrew. https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#Differences_from_Unicode – Victor Engel Dec 05 '17 at 00:21
1

To programmatically discern the type of a file -- including, but not limited to the encoding -- the best bet is to use libmagic. BSD-licensed it is part of just about every Unix-system you are about to encounter, but for a lesser ones you can bundle it with your application.

Detecting the mime-type from C, for example, is as simple as:

Magic = magic_open(MAGIC_MIME|MAGIC_ERROR);

mimetype = magic_buffer(Magic, buf, bufsize);

Other languages have their own modules wrapping this library.

Back to your question, here is what I get from file(1) (the command-line interface to libmagic(3)):

% file /tmp/*rdp
/tmp/meow.rdp: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
Mikhail T.
  • 3,043
  • 3
  • 29
  • 46
0

For your specific use-case, it's very easy to tell. Just scan the file, if you find any NULL ("\0"), it must be UTF-16. JavaScript got to have ASCII chars and they are represented by a leading 0 in UTF-16.

ZZ Coder
  • 74,484
  • 29
  • 137
  • 169