4

I am trying to read a file which may have any charset/codePage, but I don't which locale to set in order to read the file correctly.

Below is my code snippet in which I am trying to read a file having charset as windows-1256, but I want to get the charset dynamically from the file being read so that I can set the locale accordingly.

std::wifstream input{ filename.c_str() };
std::wstring content{ std::istreambuf_iterator<wchar_t>(input1), std::istreambuf_iterator<wchar_t>() };
input.imbue(std::locale(".1256"));
contents = ws2s(content); // Convert wstring to CString
Saurabh Kathpalia
  • 269
  • 1
  • 3
  • 14

2 Answers2

3

In general, this is impossible to do accurately using the content of a plain text file alone. Usually you should rely on some external information. For example, if the file was downloaded with HTTP, the encoding should be received within a response header.

Some files may contain information about the encoding as specified by the file format. XML for example: <?xml version="1.0" encoding="XXX"?>.

Unicode encodings can be detected if the file starts with a Byte Order Mark - which is optional.

You can usually assume that the encoding uses a wide character if the file contains a zero byte - which would represent the string terminator as a narrow character - before the end of the file. Likewise if you find two consecutive zeroes aligned to a 2 byte boundary (before the end), then the encoding is probably 4 bytes wide.

Other than that, you could try to guess the encoding based on the frequency of certain characters. This can have some unintended consequences.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • There is no full proof way of determining charset, but we can use ICU library for this, which gives heuristic based solution. I used a snippet from https://github.com/mooz/node-icu-charset-detector/blob/master/node-icu-charset-detector.cpp – Saurabh Kathpalia May 29 '17 at 15:53
  • @SaurabhKathpalia yeah, outsourcing to a library is a great way to save time and effort. Just keep in mind the potential pitfalls of the heuristic based approach (the last link of my answer is a practical example of such pitfall). – eerorika May 29 '17 at 16:11
2

Let me be blunt and say: you can't

Let me qualify that: a file is simply tons of 0's and 1's stuck on your disk. A charset is a way to interpret these 0's and 1's. You have to provide the information on how to interpret them, namely, by specifying a charset.

A typical way of doing that is by writing a header to specify the charset.

This is a html header

<head>
  <title>Page Title</title>
  <meta charset="UTF-8">
</head>

As you can see, the charset must be specified one way or another.

Once in a while, you do see some rogue application guessing a charset, they often do so with some heuristics on the distribution of bytes, but that is not reliable and often results in gibberish.

As a side note, try use UTF-8 everywhere, the others are, to put it lightly, messy.

Passer By
  • 19,325
  • 6
  • 49
  • 96