4

Is it possible, prior to converting a string from a charset to another, to know whether this conversion will be lossless?

If I try to convert an UTF-8 string to latin1, for example, the chars that can't be converted are replaced by ?. Checking for ? in the result string to find out if the conversion was lossless is obviously not a choice.

The only solution I can see right now is to convert back to the original charset, and compare to the original string:

function canBeSafelyConverted($string, $fromEncoding, $toEncoding)
{
    $encoded = mb_convert_encoding($string, $toEncoding, $fromEncoding);
    $decoded = mb_convert_encoding($encoded, $fromEncoding, $toEncoding);

    return $decoded == $string;
}

This is just a quick&dirty one though, that may come with unexpected behaviours at times, and I guess there might be a cleaner way to do this with mbstring, iconv, or any other library.

PeeHaa
  • 71,436
  • 58
  • 190
  • 262
BenMorel
  • 34,448
  • 50
  • 182
  • 322
  • have you tried cheking the string sizes `mb_strlen`? – adrian7 Aug 24 '12 at 22:18
  • That will fail: if a single char is converted to `?`, the lengths will be equal, at least with mbstring. Your idea would be interesting if it just dropped the non-convertible chars. – BenMorel Aug 24 '12 at 22:33
  • His example compares two strings, not the lengths of the strings. And the strings will not be equal when one of them contains a question mark where the other contains another character. – Clarence Aug 25 '12 at 10:53

1 Answers1

0

An alternative way is to set up your own error handler with set_error_handler(). If you use iconv() on the string it will throw a notice if it can not be fully converted that you can catch there and react to in your code.

Or you could just count the number of question marks before and after encoding. Or call iconv() with //IGNORE and count the number of characters.

None of the suggestions much more elegant than yours, but gets rid of the double processing.

Clarence
  • 2,944
  • 18
  • 16
  • Interesting ideas, thanks for sharing them. I'm very surprised that this is not just part of the APIs! – BenMorel Aug 24 '12 at 22:35