5

Problem Summary:

While trying to convert a site with mysql database from latin1 to utf8, some special characters are not displaying correctly despite ensuring charsets are all utf8 system wide.

Problem Details:

This is a common problem. But I seem to have an added complexity.

Years ago, a oblivious developer (me), put a site together with MySQL. Some tables were setup with latin1_swedish_ci and utf8_general_ci. All input/display was done via pages with iso-8859-1 charset.

Now, I have the task of turning all this data into utf-8 and thus finally uniforming the encoding. However, i'm having issues with a number of special characters in both instances (ie: ü). The characters don't seem to display correctly on a UTF-8 page. They display as �.Instead When viewing the data in a utf8 table in mysql query browser, a correctly entered utf8'd 'u' displays as some special characters, while an incorrectly latin1 'u' displays as it should appear on page. But it doesn't.

I've tried a number of things:

  1. Percona script: https://github.com/rlowe/mysql_convert_charset
  2. converting col to binary and then to utf8
  3. converting utf8 tables to latin and then repeat above process

Nothing seems to cure the data.

Dumping the entire database and important isn't really a viable option as it's a huge database now and downtime is restricted.

UPDATE (22-Oct-2013)

I've taken @deceze suggestions and reviewed all my content encoding areas as per http://kunststube.net/frontback/. I did find a few places in which I was still passing/encoding data in latin1. So, i've now changed it all over to UTF-8. However, the data is still displaying incorrectly in a particular field. In a table which is in utf8 (no columns have implicit encoding), field1 is in latin1. I can confirm this by running the following which displays the text correctly:

select convert(cast(convert(field1 using latin1) as binary) using utf8) from my table WHERE id = 1

This will convert Hahnemühle to Hahnemühle.

In field2, it appears the data is in a different (unknown) encoding. The query above, when used on field2 converts Hahnem�hle to Hahnem�hle. I've gone through all the charsets on http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html replacing latin1 but none seem to spit out the data correctly.

David
  • 16,246
  • 34
  • 103
  • 162
  • What exactly are those "problems" you're having? Are those characters stored correctly in their respective columns? Do they look as they should when viewing them in a decent administrative interface, or are they garbled already? Unless you're trying to store non-Latin1 characters in a Latin1 column, it doesn't really matter what setting these columns were in so far. A `latin1` column simply cannot store more than the 256 characters defined by Latin-1, other than that it doesn't matter in daily use. If you've inserted data using the correct connection encoding, there should be no problem. – deceze Oct 21 '13 at 14:23
  • @deceze I've added the problem (sorry). The characters aren't being displayed correctly on a utf-8 page. Instead I get a diamond question mark in Google Chrome. When viewing the data in a utf8 table in mysql query browser, a correctly entered utf8'd 'u' displays as some special characters, while an incorrectly latin1 'u' displays as it should appear on page. But it doesn't. – David Oct 21 '13 at 14:32
  • Have you updated the connection encoding so data coming from the database is actually UTF-8 encoded? Or are you still retrieving data in Latin-1? – deceze Oct 21 '13 at 14:34
  • @deceze it was set as latin1_swedish_ci and i've changed it to utf8_general_ci as recommended in the Percona script. – David Oct 21 '13 at 14:36
  • 1
    Nope, not the *column collation*, the **connection encoding**. Please read [Handling Unicode Front To Back In A Web App](http://kunststube.net/frontback/) for a rundown of the concept. – deceze Oct 21 '13 at 14:37
  • @deceze character_set_connection = utf8 and collation_connection = utf8_general_ci – David Oct 21 '13 at 14:46
  • Well, if the browser is displaying a �, that means it's trying to interpret whatever you're sending it as UTF-8, but you're not actually sending it UTF-8. If you set the connection encoding to the database to `utf8`, you should get UTF-8 encoded data, however it is stored in your columns. Somewhere along the chain database → PHP → browser the data is converted to something that's not UTF-8. Can't tell you where with the information given. – deceze Oct 21 '13 at 14:48
  • @deceze This is why I think it's something to do with the data. – David Oct 21 '13 at 14:55
  • It shouldn't. If you tell MySQL you want `utf8`, it'll give you valid UTF-8. The characters may be garbage because you've mistreated them in the past, but it should be valid UTF-8 encoded garbage. – deceze Oct 21 '13 at 14:57
  • @deceze If I spit out the problematic text on a php script and wrap it in htmlentities() it's all good. But use htmlsepcialchars() (or nothing) and it's no good. – David Oct 21 '13 at 15:18
  • Please show the `bin2hex($string)` value of the string in question and the text that *should* show. – deceze Oct 21 '13 at 15:27
  • @deceze Hahnemühle = 4861686e656dfc686c65 – David Oct 21 '13 at 15:46
  • Yup, that string is *Latin-1 encoded*! – deceze Oct 21 '13 at 15:47
  • wtf! Everything is utf8 now – David Oct 21 '13 at 15:56
  • Well, it's not. Somewhere your chain is broken. Make sure you understand everything in the aforelinked article and double check everything in your own chain. – deceze Oct 21 '13 at 17:01
  • @deceze Thanks for the advice. I've gone through the link and there were some inconsistencies which have been corrected now. However, I'm still having issues. Please see the update posted above. – David Oct 22 '13 at 10:20

3 Answers3

5

Setting a column to latin1 and others to utf8 is perfectly fine in MySQL. There's no problem to be solved here as such. This charset parameter just influences how the data is stored internally. Which of course also means that you cannot store, for example, "漢字" in a latin1 column. But assuming you're just storing "Latin-1 characters" in there, that's fine.

MySQL has something commonly called the connection encoding. It tells MySQL what encoding text is in that you send to it from PHP (or elsewhere), and what encoding you'd like back when retrieving data from MySQL. The column charset, the "input connection encoding" and "output connection encoding" can all be different things, MySQL will convert encodings on the fly accordingly as needed.

So, assuming you've used the correct connection encodings so far and data is stored properly in your database and you've not tried to store non-Latin-1 characters in Latin-1 columns, all you need to do to update your column charsets to UTF-8 is:

ALTER TABLE table MODIFY column TEXT [...] CHARACTER SET utf8;
deceze
  • 510,633
  • 85
  • 743
  • 889
5

You can try mysqldump to convert from ISO-8859-1 to utf-8:

mysqldump --user=username --password=password --default-character-set=latin1 --skip-set-charset dbname > dump.sql
chgrep latin1 utf8 dump.sql (or when you prefer  sed -i "" 's/latin1/utf8/g' dump.sql) 
mysql --user=username --password=password --execute="DROP DATABASE dbname; CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
mysql --user=username --password=password --default-character-set=utf8 dbname < dump.sql
Micromega
  • 12,486
  • 7
  • 35
  • 72
3

You may get rid of the "glyph" characters (�) by applying utf8_encode to the string before displaying it in your page.

rob
  • 31
  • 1