1

I need to be able to detect a string's encoding but mb_detect_encoding isn't working.

I obtain the string from a file (file_get_contents) and I know the file that was giving me trouble was in UTF-16 LE. However, from the docs what I understand is that detecting this encoding is not possible (mb_detect_order : "For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.").

How can I obtain a string's encoding in a trustworthy way in PHP? Any possible encoding?

I lost multiple hours trying to solve this but I found no good resource. I would like to be able to automate this so if my file changes its encoding, my program will be able to handle it (I am obtaining the file from another website).


I've tried this with no success, it tells me UTF-8:

mb_detect_encoding($proper_string, 'UTF-16LE,UCS-2,UTF-8,ASCII', true)

I've also tried this:

echo 'mb_check_encoding($fileContents, \'UTF-8\'): ' . mb_check_encoding($fileContents, 'UTF-8') . "\n";
//true
echo 'mb_check_encoding($fileContents, \'UTF-16\'): ' . mb_check_encoding($fileContents, 'UTF-16') . "\n";
//true
echo 'mb_check_encoding($fileContents, \'UTF-16LE\'): ' . mb_check_encoding($fileContents, 'UTF-16LE') . "\n";
//true
echo 'mb_check_encoding($fileContents, \'UCS-2\'): ' . mb_check_encoding($fileContents, 'UCS-2') . "\n";
//true
echo 'mb_check_encoding($fileContents, \'ISO-8859-1\'): ' . mb_check_encoding($fileContents, 'ISO-8859-1') . "\n";
//true
loco.loop
  • 1,441
  • 1
  • 15
  • 27
  • 1
    Detecting a character encoding is not trustworthy at all. It's a process of eliminating possibilities from a set of encodings under consideration and then picking one. If there is more than one possible (and there almost always is) or your starting set does not include any possibility, you could be wrong. Ex: if you consider CP437, it is always possible. You should read a text file or stream with the encoding you know it to be. If you don't know then you have a failed communication. From a web server, it might already be telling you what the encoding is using the HTTP Content-Type header. – Tom Blodget Oct 08 '16 at 17:23
  • Hadn't thought about http headers, I will look into that. It might solve my problem of not knowing. Thanks. – loco.loop Oct 08 '16 at 17:26
  • That was helpful, I now know that the file is UTF-16. My problem now is that when opened with Sublime I see UTF-16 LE, how can I detect this? – loco.loop Oct 08 '16 at 17:49
  • If the first codepoint is U+FEFF, that's a Byte Order Mark (and not part of the text). See 3.1 and 3.2 in [RFC 2781](https://tools.ietf.org/html/rfc2781#section-3). It there isn't, the byte order should have been identified along with UTF-16. (If the byte order is specified, there should not be a BOM.) If neither case, you'd be back to heuristics but only to choose between two possibilities. – Tom Blodget Oct 08 '16 at 20:55

0 Answers0