0

First of all, I have to say that; I am a stranger of multilingual conversions.

I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use

$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);

to achive my requirements (an UTF8, lowercase string)

However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get à as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).

What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?

Notes: I read about @iconv questions here, but I think it is not a good solution to have empty database entries.


Thanks to all answers, I will read and try to understand each of them.

Hilmi Erdem KEREN
  • 1,949
  • 20
  • 29
  • Your input is not UTF-8. If you really used `utf_encode()` to create it, it's possible that your original text was not ISO-8859-1. – Álvaro González Feb 11 '14 at 13:14
  • My input ÇokGüŞelLl is UTF 8 (also saved UTF8-general in MySQL) and returning result is also the same. However i dont know about that à thing. I use the exact codes in real code too. – Hilmi Erdem KEREN Feb 11 '14 at 13:16
  • If you store data as UTF-8 and you need data as UTF-8, why do you convert from ISO-8859-1? You can use [bin2hex](http://es1.php.net/bin2hex) to know what your actual bytes are. – Álvaro González Feb 11 '14 at 13:18
  • A hint: $mysqi->set_charset("utf8"); Because else you probably get your database output as ISO-8859-1, even though your column says UTF-8. – Lorenz Feb 11 '14 at 13:20
  • Thanks @Aragon0 I am already doing that. – Hilmi Erdem KEREN Feb 11 '14 at 13:21
  • I can't get it @ÁlvaroG.Vicario . I am trying to save user generated data to my database table; so I will be able to use them for CI purposes later. And because our users are world wide, I can not guess what their keyboard lets them write. That is why I use iconv – Hilmi Erdem KEREN Feb 11 '14 at 13:24
  • 1
    Their keyboard doesn't *write* anything, the browser submits it in an encoding and all modern browsers default to utf-8 at that point, unless you change it with the `accept-charset` attribute on your form. – Fleshgrinder Feb 11 '14 at 13:25
  • 1
    Drupal for instance always sets `accept-charset` to `UTF-8`, you can do that as well if you have users who use a totally broken client. – Fleshgrinder Feb 11 '14 at 13:37
  • I see, I just realised that my "/(?:^|\s)\pL+/" match moves my 'çokGüŞelLl' string to $matches as "Ã". It is not cocoarestclient. Probably that is why I get that much different errors from iconv. – Hilmi Erdem KEREN Feb 11 '14 at 13:46

3 Answers3

2

The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.

Ensure that your data is proper UTF-8 before saving it to your database:

// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
    throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
}

// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);

Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.

$string = $database->getSomeRecordWithUnicode();

echo mb_strtolower($string);

Done!

PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.

PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.

About HTML forms

Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.

<!doctype html>
<html>
<body>
    <form accept-charset="UTF-8">

Now all browsers should encode the data they submit in utf-8.

Fleshgrinder
  • 15,703
  • 4
  • 47
  • 56
1

If you encode çokGüŞelLl as UTF-8 you should get the following bytes:

var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"

That's a check you must do. You also have this:

utf8_encode($str)

Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.

So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.

Álvaro González
  • 142,137
  • 41
  • 261
  • 360
  • It is preg_match_all(). Changing regex from '/(?:^|\s)\#\pL+/' to '/(*UTF8)(?:^|\s)\#\pL+/' fixed my issue. Thanks for illuminating my way. – Hilmi Erdem KEREN Feb 11 '14 at 14:18
0

You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.

Lorenz
  • 2,179
  • 3
  • 19
  • 18