Detect charset of file dynamically in c++

Question

I am trying to read a file which may have any charset/codePage, but I don't which locale to set in order to read the file correctly.

Below is my code snippet in which I am trying to read a file having charset as windows-1256, but I want to get the charset dynamically from the file being read so that I can set the locale accordingly.

std::wifstream input{ filename.c_str() };
std::wstring content{ std::istreambuf_iterator<wchar_t>(input1), std::istreambuf_iterator<wchar_t>() };
input.imbue(std::locale(".1256"));
contents = ws2s(content); // Convert wstring to CString

score 3 · Accepted Answer · answered May 11 '17 at 13:20

In general, this is impossible to do accurately using the content of a plain text file alone. Usually you should rely on some external information. For example, if the file was downloaded with HTTP, the encoding should be received within a response header.

Some files may contain information about the encoding as specified by the file format. XML for example: <?xml version="1.0" encoding="XXX"?>.

Unicode encodings can be detected if the file starts with a Byte Order Mark - which is optional.

You can usually assume that the encoding uses a wide character if the file contains a zero byte - which would represent the string terminator as a narrow character - before the end of the file. Likewise if you find two consecutive zeroes aligned to a 2 byte boundary (before the end), then the encoding is probably 4 bytes wide.

Other than that, you could try to guess the encoding based on the frequency of certain characters. This can have some unintended consequences.

There is no full proof way of determining charset, but we can use ICU library for this, which gives heuristic based solution. I used a snippet from https://github.com/mooz/node-icu-charset-detector/blob/master/node-icu-charset-detector.cpp — Saurabh Kathpalia, May 29 '17 at 15:53
@SaurabhKathpalia yeah, outsourcing to a library is a great way to save time and effort. Just keep in mind the potential pitfalls of the heuristic based approach (the last link of my answer is a practical example of such pitfall). — eerorika, May 29 '17 at 16:11

score 2 · Answer 2 · answered May 11 '17 at 13:19

Let me be blunt and say: you can't

Let me qualify that: a file is simply tons of 0's and 1's stuck on your disk. A charset is a way to interpret these 0's and 1's. You have to provide the information on how to interpret them, namely, by specifying a charset.

A typical way of doing that is by writing a header to specify the charset.

This is a html header

<head>
  <title>Page Title</title>
  <meta charset="UTF-8">
</head>

As you can see, the charset must be specified one way or another.

Once in a while, you do see some rogue application guessing a charset, they often do so with some heuristics on the distribution of bytes, but that is not reliable and often results in gibberish.

As a side note, try use UTF-8 everywhere, the others are, to put it lightly, messy.

Detect charset of file dynamically in c++

2 Answers2