Why do mbstring functions incorrectly identify ISO-8859 strings?

Question

Despite listing each ISO-8859 character set as an individual encoding, the mbstring functions treat every ISO-8859 character set interchangeably. To drive the point home:

$strings = [ 
  'English'   => 'Ea vim decore sapientem repudiandae. Sea cu delenit gamu mutn, tic.',
  'Cyrillic'  => 'Лорем ипсум долор сит амет, ин ехерци вереар номинати яуи, сит ин омниум инермис но.',
  'Greek'     => 'Λορεμ ιπσθμ δολορ σιτ αμετ, ηασ γραεcο νθσqθαμ cθ, εστ θτ εσσε διcαμ qθαλισqθε cθ.',
  'Armenian'  => 'լոռեմ իպսում դոլոռ սիթ ամեթ, եամ նո թաթիոն ծոմպռեհենսամ, իուս ադ նիսլ ոմնիս մինիմ եսթ',
  'Georgian'  => 'ლორემ იფსუმ დოლორ სით ამეთ, ეხ ყუანდო ცოფიოსაე უსუ, იუს ეუ ჰინც ვერო დომინგ ჰის',
  'Hindi'     => 'वर्ष एसेएवं व्याख्यान संदेश होने लक्षण एसेएवं पहोचाना विचरविमर्श? वर्णन करती आशाआपस अन्तरराष्ट्रीयकरन. रहारुप कार्यसिधान्त',
  'Korean'    => '모든 국민은 보건에 관하여 국가의 보호를 받는다, 전직대통령의 신분과 예우에 관하여는 법',
  'Arabic'    => 'مع لهذه الهجوم عدم, فكان اتفاق الصفحات من أسر. وجزر عُقر أما بـ, عل دار بقسوة المتّبعة بالولايات. وإقامة والفرنسي كل لكل. أي',
  'Hebrew'    => 'עמוד מדינות, חפש ואלקטרוניקה אנתרופולוגיה דת, מה קהילה הקהילה טכנו'
];

$encodings = ['ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15' ];

foreach( $strings as $lang => $text ) {
    echo $lang . " is encoded as " . mb_detect_encoding( $text, $encodings ) . "\n";

    foreach( $encodings as $encoding ) {
        echo " - is " . (mb_check_encoding( $text, $encoding ) ? "" : "not ") . $encoding . "\n";
    }
}

This produces output to the effect of

Hindi is encoded as ISO-8859-1
  - is ISO-8859-1
  - is ISO-8859-2
  - is ISO-8859-3
  - is ISO-8859-4
  - is ISO-8859-5
  - is ISO-8859-6
  - is ISO-8859-7
  - is ISO-8859-8
  - is ISO-8859-9
  - is ISO-8859-10
  - is ISO-8859-13
  - is ISO-8859-14
  - is ISO-8859-15

with identical results for every listed language, which is clearly not true.

Why does mbstring list every ISO-8859 encoding separately but treat them interchangeably? Is there any way to reliable detect the proper spec?

Or am I simply misusing these functions?

Use `echo $lang . " is encoded as " . mb_detect_encoding( $text ) . "\n";` instead. See the difference — RiggsFolly, Mar 25 '17 at 11:00
@RiggsFolly I certainly see the difference - I was just hoping that the mbstring functions might differentiate between the different `ISO-8859` encodings. It seems like the module should just group them all together as `ISO-8859` instead of listing them separately when it doesn't actually have the necessary logic to differentiate between them. — bosco, Mar 25 '17 at 11:04
@PaulCrovella honestly this has been my first brush messing around with the mbstring functions - I hadn't realized their limitations, nor investigated their actual implementation. I was trying to work out a way to [sort strings for different alphabets](http://wordpress.stackexchange.com/questions/261038/limit-search-to-latin-characters/261324#261324) without resorting to manually assembled regex. — bosco, Mar 25 '17 at 11:15
@PaulCrovella you just blew my mind - I had no idea you could match entire Unicode scripts with regular expressions! — bosco, Mar 25 '17 at 11:22
@PaulCrovella I will most certainly do that - it's long past time that I get up to speed with internationalization, anyway. Thanks for enlightening me! If you care to throw the information regarding mbstring functions only checking byte sequences into an answer, it would definitely fulfill my question as it stands. Thanks again — bosco, Mar 25 '17 at 11:52

score 3 · Accepted Answer · answered Mar 25 '17 at 12:08

mb_detect_encoding makes a guess as to what the encoding might be, it is not possible for this sort of thing to be accurate (and this function doesn't do much to try.)

mb_check_encoding tells you if a string consists of a byte sequence that is valid for the given encoding, and given that every possible byte is valid in each ISO-8859-* it's pointless to validate against them (these will always return true.)

For related reading I very much recommend: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Tragically, in the course of my brief research regarding the mbstring functions, that article appeared in my search results a few times but I never visited it. Mistakes were made O.o — bosco, Mar 25 '17 at 12:12

Why do mbstring functions incorrectly identify ISO-8859 strings?

1 Answers1