Please anyone can tell how to, using windows api in my program (Alaska Xbase++), can I check if a TXT file is UTF (8, 16, etc.) or ANSI? Thanks
I searched the web and found nothing.
Please anyone can tell how to, using windows api in my program (Alaska Xbase++), can I check if a TXT file is UTF (8, 16, etc.) or ANSI? Thanks
I searched the web and found nothing.
Unless the file has an explicit byte order mark (BOM), there's no way of knowing what encoding any given sequence of bytes assumes (and a BOM only applies to UTF, not "ANSI"). The best you can do at that point is guess, by running a few heuristics against the input.
To my knowledge, there are only two APIs in the system that can help you with the guessing part:
IsTextUnicode
: A simple binary classifier that produces "looks like Unicode" and "doesn't look like Unicode" results (where "Unicode" means "UTF-16", essentially).IMultiLanguage2::DetectCodepageInIStream
: A more elaborate guesser, capable of classifying more codepages, multiple codepages even, and a corresponding confidence level for each.As is the norm, Raymond Chen has covered this issue in his blog entry Further adventures in trying to guess what encoding a file is in.
I wrote my own heuristic based function that guesses the encoding from among a set that was important for my purposes: plain ASCII, UTF-8, UTF-16LE, UTF-16BE, ISO 8859-1, Windows 1252, MacOS Roman, IBM437, and DEC's multinational character set. It'll also guess if it's not actually text but rather a binary file.
I came up with a list of "features" that could help distinguish between those encodings. For example, the presence of a UTF-16LE BOM is a feature. For each possible (feature, encoding) pair, I assigned a signed weight value. The code scans the bytes of the file (usually a smallish sample is fine) to detect the features. Then it tallies up the feature weights for each encoding and chooses the encoding with the highest tally.
Scanning is very cheap. A custom state machine examines the first few bytes for the various BOMs, and then it creates a histogram of the various byte values. Nearly all of the features can be detected by looking at subsets of the histogram.
For distinguishing UTF-8 from the others, the most useful features (other than a BOM) are (1) whether the number of leading bytes predicts the number of continuation bytes and (2) whether there are any byte values that are not legal in UTF-8.
Most of the rest of the features are for distinguishing between sometimes subtle differences because there's a lot of overlap among encodings like ISO 8859-1, Win1252, and DEC MCS. I'm not sure how well this technique would hold up if you were trying to identify specific code pages from a much broader set.