microsoft windows api: how to know a file is utf?

Question

Please anyone can tell how to, using windows api in my program (Alaska Xbase++), can I check if a TXT file is UTF (8, 16, etc.) or ANSI? Thanks

I searched the web and found nothing.

You can't, you have to know. See https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/. — CodeCaster, May 11 '23 at 14:45
There is no windows API for that. Normally the application open the file and looks to the first 2 bytes for a BOM (https://en.wikipedia.org/wiki/Byte_order_mark). If it is missing some trial must be done. you can examine contents for codes ==0x00 or > 0x7f. In the first case, 0x00 codes, in the odd address means UTF-16BE UTF32, 0x00 at even addresses an UTF16LE. More difficult to detect UTF-8 codings. Anyway it must be considered a trial detection, and can never be certain. Googling you can find many examples, but keep in mind that you can't deterministically detect the file encoding. Never. — Frankie_C, May 11 '23 at 14:58
https://devblogs.microsoft.com/oldnewthing/20150223-00/?p=44613 but yeah, it's impossible to know for sure. Why are you working with files for which the encoding is unknown? — David Heffernan, May 11 '23 at 15:00
Also, [There's No Such Thing As Plain Text by Dylan Beattie](https://youtu.be/hI-eAg3hlcM). A fun talk to watch, even when you aren't interested in learning about the topic. — IInspectable, May 11 '23 at 15:08
It's not windows API but there is also ICU's's ucsdet.h (character set detection library), which in conjunction with ucnv.h (conversion library) can be used to handle a much wider set of potential character sets. — SoronelHaetir, May 11 '23 at 16:45

IInspectable · Answer 1 · 2023-05-11T15:10:45.797

2

Unless the file has an explicit byte order mark (BOM), there's no way of knowing what encoding any given sequence of bytes assumes (and a BOM only applies to UTF, not "ANSI"). The best you can do at that point is guess, by running a few heuristics against the input.

To my knowledge, there are only two APIs in the system that can help you with the guessing part:

IsTextUnicode: A simple binary classifier that produces "looks like Unicode" and "doesn't look like Unicode" results (where "Unicode" means "UTF-16", essentially).
IMultiLanguage2::DetectCodepageInIStream: A more elaborate guesser, capable of classifying more codepages, multiple codepages even, and a corresponding confidence level for each.

As is the norm, Raymond Chen has covered this issue in his blog entry Further adventures in trying to guess what encoding a file is in.

edited May 11 '23 at 15:10

answered May 11 '23 at 15:03

IInspectable

46,945
8
85
181

2

Actually, Raymond has some additional code to somewhat not abandon all hope in the no BOM UTF8 vs Ansi battle (I think reasonably recent Notepad versions uses it): https://devblogs.microsoft.com/oldnewthing/20190701-00/?p=102636 – Simon Mourier May 11 '23 at 15:52
@SimonMourier Yes, sure. That's another way to guess, another heuristic to choose from. It's still down to guessing, and it's not exposed as a system service. Which the question is asking for. – IInspectable May 11 '23 at 16:14
IIRC, `IsTextUnicode` is what Notepad uses to infer the encoding. It does inspect for a BOM header and uses heuristics to guess otherwise. So if you want to be a good as Notepad, it's probably not a bad way to go. DetectCodepageInStream is probably good too. – selbie May 11 '23 at 17:09

score 1 · Answer 2 · answered May 11 '23 at 16:52

I wrote my own heuristic based function that guesses the encoding from among a set that was important for my purposes: plain ASCII, UTF-8, UTF-16LE, UTF-16BE, ISO 8859-1, Windows 1252, MacOS Roman, IBM437, and DEC's multinational character set. It'll also guess if it's not actually text but rather a binary file.

I came up with a list of "features" that could help distinguish between those encodings. For example, the presence of a UTF-16LE BOM is a feature. For each possible (feature, encoding) pair, I assigned a signed weight value. The code scans the bytes of the file (usually a smallish sample is fine) to detect the features. Then it tallies up the feature weights for each encoding and chooses the encoding with the highest tally.

Scanning is very cheap. A custom state machine examines the first few bytes for the various BOMs, and then it creates a histogram of the various byte values. Nearly all of the features can be detected by looking at subsets of the histogram.

For distinguishing UTF-8 from the others, the most useful features (other than a BOM) are (1) whether the number of leading bytes predicts the number of continuation bytes and (2) whether there are any byte values that are not legal in UTF-8.

Most of the rest of the features are for distinguishing between sometimes subtle differences because there's a lot of overlap among encodings like ISO 8859-1, Win1252, and DEC MCS. I'm not sure how well this technique would hold up if you were trying to identify specific code pages from a much broader set.

microsoft windows api: how to know a file is utf?

2 Answers2