What is the preferable character encoding order for "mb_detect_encoding( )" in PHP?

Question

What is the preferable character encoding order to be passed as second argument of mb_detect_encoding( ).

I am asking this because some character encoding overlap others like ASCII is returned for UTF (in some cases) and EUC-CN for gb2312, and anyone of EUC-CN,EUC-JP,EUC-KR,EUC-TW whichever appears earlier in the sequence passed to the function is returned for simplified Chinese EUC-CN compatible string.

Here are some that i collected, but i want to make the list as comprehensive as possible.

EUC-CN
EUC-JP
EUC-KR
EUC-TW
SJIS
ASCII
JIS
UTF-8
EUC-JP
EUC-CN
EUC-KR
EUC-TW
SJIS

Kindly help me to correct the order and make this list as large as possible.

Edit 1:

All I want to do using this is to convert any string to utf8.

Edit 2:

Considering the below suggestions, I want to minimize the possibilities of text getting wasted in encoding conversion, because the converted text is the only thing my site relies on. So, even if the solution i am using is not the perfect one. Would you please demonstrate the most reliable solution?

Well, yes, many encodings are indistinguishable from each other. That's why accurate automatic detection is generally impossible. This function can be useful for a small number of candidates, but is useless for a large number of them, as you are seeing. You'll need a statistical text analyzer on top of that to increase the accuracy, which is not build into PHP. You should not have to use encoding detection to begin with. — deceze, Aug 02 '12 at 17:49

score 2 · Answer 1 · answered Aug 02 '12 at 17:49

2

There is no true preferred order that gives you the most accurate response.

There will always be strings that can potentially be detected and valid in a number of character sets. mb_detect_encoding cannot determine which is the correct one.

The only way to solve this, is to:

Have a human that understands the language to select the correct encoding.
Potentially analyze the actual text in your string, and 'guess' which is the most likely to be correct.

For number two I wouldn't know a ready-made option, but I can imagine things like character-occurance-rates, Bayesian filters, neural networks and dictionary checks could be useful ;)

answered Aug 02 '12 at 17:49

Evert

93,428
18
118
189

2

Can I add: if you don't know the encoding of your text, there is a problem somewhere else. – InternetSeriousBusiness Aug 02 '12 at 17:51
@InternetSeriousBusiness PEBKAC? – Event_Horizon Aug 02 '12 at 18:12
Often, the document will include encoding information, e.g. `$ curl -I http://stackoverflow.com --silent | fgrep charset` `Content-Type: text/html; charset=utf-8` see http://en.wikipedia.org/wiki/Character_encodings_in_HTML – Frank Farmer Aug 02 '12 at 18:23

What is the preferable character encoding order for "mb_detect_encoding( )" in PHP?

1 Answers1