0

Please anyone can tell how to, using windows api in my program (Alaska Xbase++), can I check if a TXT file is UTF (8, 16, etc.) or ANSI? Thanks

I searched the web and found nothing.

EngiDev
  • 41
  • 6
  • 1
    You can't, you have to know. See https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/. – CodeCaster May 11 '23 at 14:45
  • There is no windows API for that. Normally the application open the file and looks to the first 2 bytes for a BOM (https://en.wikipedia.org/wiki/Byte_order_mark). If it is missing some trial must be done. you can examine contents for codes ==0x00 or > 0x7f. In the first case, 0x00 codes, in the odd address means UTF-16BE UTF32, 0x00 at even addresses an UTF16LE. More difficult to detect UTF-8 codings. Anyway it must be considered a trial detection, and can never be certain. Googling you can find many examples, but keep in mind that you can't deterministically detect the file encoding. Never. – Frankie_C May 11 '23 at 14:58
  • https://devblogs.microsoft.com/oldnewthing/20150223-00/?p=44613 but yeah, it's impossible to know for sure. Why are you working with files for which the encoding is unknown? – David Heffernan May 11 '23 at 15:00
  • Also, [There's No Such Thing As Plain Text by Dylan Beattie](https://youtu.be/hI-eAg3hlcM). A fun talk to watch, even when you aren't interested in learning about the topic. – IInspectable May 11 '23 at 15:08
  • It's not windows API but there is also ICU's's ucsdet.h (character set detection library), which in conjunction with ucnv.h (conversion library) can be used to handle a much wider set of potential character sets. – SoronelHaetir May 11 '23 at 16:45

2 Answers2

2

Unless the file has an explicit byte order mark (BOM), there's no way of knowing what encoding any given sequence of bytes assumes (and a BOM only applies to UTF, not "ANSI"). The best you can do at that point is guess, by running a few heuristics against the input.

To my knowledge, there are only two APIs in the system that can help you with the guessing part:

  • IsTextUnicode: A simple binary classifier that produces "looks like Unicode" and "doesn't look like Unicode" results (where "Unicode" means "UTF-16", essentially).
  • IMultiLanguage2::DetectCodepageInIStream: A more elaborate guesser, capable of classifying more codepages, multiple codepages even, and a corresponding confidence level for each.

As is the norm, Raymond Chen has covered this issue in his blog entry Further adventures in trying to guess what encoding a file is in.

IInspectable
  • 46,945
  • 8
  • 85
  • 181
  • 2
    Actually, Raymond has some additional code to somewhat not abandon all hope in the no BOM UTF8 vs Ansi battle (I think reasonably recent Notepad versions uses it): https://devblogs.microsoft.com/oldnewthing/20190701-00/?p=102636 – Simon Mourier May 11 '23 at 15:52
  • @SimonMourier Yes, sure. That's another way to guess, another heuristic to choose from. It's still down to guessing, and it's not exposed as a system service. Which the question is asking for. – IInspectable May 11 '23 at 16:14
  • IIRC, `IsTextUnicode` is what Notepad uses to infer the encoding. It does inspect for a BOM header and uses heuristics to guess otherwise. So if you want to be a good as Notepad, it's probably not a bad way to go. DetectCodepageInStream is probably good too. – selbie May 11 '23 at 17:09
1

I wrote my own heuristic based function that guesses the encoding from among a set that was important for my purposes: plain ASCII, UTF-8, UTF-16LE, UTF-16BE, ISO 8859-1, Windows 1252, MacOS Roman, IBM437, and DEC's multinational character set. It'll also guess if it's not actually text but rather a binary file.

I came up with a list of "features" that could help distinguish between those encodings. For example, the presence of a UTF-16LE BOM is a feature. For each possible (feature, encoding) pair, I assigned a signed weight value. The code scans the bytes of the file (usually a smallish sample is fine) to detect the features. Then it tallies up the feature weights for each encoding and chooses the encoding with the highest tally.

Scanning is very cheap. A custom state machine examines the first few bytes for the various BOMs, and then it creates a histogram of the various byte values. Nearly all of the features can be detected by looking at subsets of the histogram.

For distinguishing UTF-8 from the others, the most useful features (other than a BOM) are (1) whether the number of leading bytes predicts the number of continuation bytes and (2) whether there are any byte values that are not legal in UTF-8.

Most of the rest of the features are for distinguishing between sometimes subtle differences because there's a lot of overlap among encodings like ISO 8859-1, Win1252, and DEC MCS. I'm not sure how well this technique would hold up if you were trying to identify specific code pages from a much broader set.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175