4

I have a PHP application with a MYSQL database that "should" contain UTF8 encoded data. With regard to unicode characters, my application appears to work properly with beginning to end. If someone submits "Strömgren" into my database (via an HTML form), I see "Strömgren" when I get the data back out, etc.

My database tables are all UTF8 and my html pages and forms are all charset=utf-8.

I recently noticed that in one portion of my application my unicode characters appeared to be double-encoded. When I displayed what should be Strömgren, I saw Strömgren -- Str\xc3\xb6mgren vs Str\xc3\x83\xc2\xb6mgren. If I utf8_decode the bad string, it looks correct again.

I am assuming that this is "double-encoding."

I discovered that the portion of the application that was displaying the double-encoded data was using different code to make its database connection, and that code was making this call:

$db->set_charset("utf8")

I had intended to do that for ALL of my database connections, but somehow ended up only doing it in one place. So, almost all of my application is using connections without the set_charset command, and Strömgren always looks right, and the lone piece of code which does have set_charset("utf8") (and which only ever reads from the db, never writes to it), is displaying it incorrectly.

I am not certain what to make of this, but my suspicion is that the data in my database is not really stored in UTF8 encoding? Maybe when I send it Strömgren (without having set_charset("utf8")), it thinks it is receiving latin1 (or whatever), and when I read that back out I am getting latin1, but since my html pages have "charset=utf-8" it is being "mis-displayed" as Strömgren when really the database thinks it is sending me Strömgren. (I am probably not saying that either correctly OR clearly, but I hope it can be understood.)

I have two questions:

First, does any of my thinking here make sense, or am I completely off base?

Second, what is the best way for me to determine whether the data in my database is mis-encoded (i.e. does the database actually contain Strömgren or Strömgren)?

Rick Free
  • 117
  • 1
  • 6
  • Reading this page, I am becoming more convinced that I have filled my db with bad data: http://artur.ejsmont.org/blog/content/utf8-double-encoding-issues-in-web-apps-and-mysql – Rick Free Jun 12 '15 at 22:50

3 Answers3

4

One way to see what is actually stored is to use the HEX function. (That's the closest MySQL gets to the Oracle-style DUMP() function.

Here's a demonstration that shows the use of the HEX function to return what's stored...

  CREATE TABLE foo 
  ( foo_lat VARCHAR(10) CHARSET latin1
  , foo_utf VARCHAR(10) CHARSET utf8
  );

  INSERT INTO foo (foo_lat, foo_utf) VALUES
  ( UNHEX('6dc3b1c3b6'), UNHEX('6dc3b1c3b6') );

  SELECT foo_lat
       , foo_utf
       , HEX(foo_lat)
       , HEX(foo_utf)
    FROM foo ;

foo_lat    foo_utf  HEX(foo_lat)  HEX(foo_utf)  
---------  -------  ------------  --------------
mñö      mñö      6DC3B1C3B6    6DC3B1C3B6   

Your thinking seems pretty clear.

The set_charset function is the recommended method for specifying the client characterset, using the msyqli interface.

I'm kind of curious what the characterset is before you run that.

  $db->character_set_name();

I'm also curious... from that same connection, what the following query returns.

 SELECT @@session.character_set_client
      , @@session.character_set_connection
      , @@session.character_set_results
      , @@session.character_set_server
      , @@global.character_set_client
      , @@global.character_set_connection
      , @@global.character_set_results
      , @@global.character_set_system

... from both a copy of example code that is displaying characters "correctly", and a copy of the example code that is displaying characters "incorrectly", before and after you do the set_charset.

If you see latin1 anywhere, that could be a problem.

If there are UTF-8 encoded values stored in latin1 columns, that's a problem. When you go to pull those value out of the database with a utf8 characterset, then the values will get "double encoded".

So, verify that your characterset on the column is utf8.

WARNING: If you do have UTF-8 values stored in latin columns, DO NOT TRY TO FIX the problem by converting the columns to utf8, that will make the problem worse by double encoding the stored values.

If you want to play around with that, do that on a separate test database; it might be a good time to test whether the restore your mysqldump backups to another test MySQL instance on another test machine is working. If the .sql files produced by mysqldump are garfed, you want to find that out now, rather than later, when you actually need to do a restore.)


NOTE: It's the characterset on the column definition that's important. The setting on the table is just a default value used when it's not specified on the column. And the setting at the database level is just a default that's used when a table is created without a characterset specified.

That is, changing the characterset of the database doesn't affect the existing tables and columns. It will have an affect on any CREATE TABLE that doesn't specify a characterset.

A SHOW CREATE TABLE foo is a convenient way to see the actual characterset of the table and columns.

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • Thanks much. I confirmed table is UTF8 and I used SELECT HEX(last_name) and got solid confirmation that the data in the table is NOT what I thought it was. My use of an inproperly encoded connection for both writing and reading has been masking this issue for months. Next up? Finding a way to fix the data (once I fix the app). This page is proving informative: http://artur.ejsmont.org/blog/content/utf8-double-encoding-issues-in-web-apps-and-mysql – Rick Free Jun 13 '15 at 00:34
0

Each table has a default character set and a collation used to store its values. Find them by using:

SHOW FULL COLUMNS FROM table_name;

and

SHOW CREATE TABLE table_name;

Then you can alter a table to have UTF-8 like this:

ALTER TABLE tbl_name
CONVERT TO CHARACTER SET 'UTF-8'

To fix the encoding of a table defined as latin1 and filled in with UTF-8 data:

ALTER TABLE table_name CHANGE field field blob;
ALTER TABLE table_name CHANGE field field text charset utf8;
  • I know my tables are UTF-8. What I first need to do is to find out how to confirm my suspicion that I have filled my table with double-encoded data by connecting to it via a non-utf8 connection. I am thinking of something along the lines of, is there a way I can connect to the database other than via my PHP script, select a row, see whether I get Strömgren or Strömgren, and know for certain that either my data IS mucked up, or I need to look for another explanation. – Rick Free Jun 12 '15 at 22:48
  • Good story, but it doesn't answer the question. OP probably knows how to change the character set, but after you set it to UTF-8, you can still double, triple, or gazipple encode the result. – GolezTrol Jun 12 '15 at 22:48
  • I have messed in the past with encodings, and its... a mess. You should treat the results using the original encodings or you will have trouble. – Jose Luis Bernat Jun 12 '15 at 22:49
  • You can assume that MYSQL is not doing any conversion of charsets if the connections charset and the table charset are equal. So the representation of the string lies entirely on the HTML side. When you started using $db->set_charset("utf8"), MYSQL is doing a translation from a utf-8 string stored in a latin1 field, to a doubled encoded utf-8 string. – Jose Luis Bernat Jun 12 '15 at 22:57
0

Strömgren instead of Strömgren implies Mojibake.

If SELECT HEX(...) FROM ... gives you 53 74 72 C3B6 6D 67 72 65 6E (without the spaces), you have correctly stored the utf8 encoding. C3B6 is the utf8 hex for ö.

"Double encoding" would show 53 74 72 C383 C2B6 6D 67 72 65 6E where C383 and C2B6 are utf8 hex for à and .

See duplicate for discussion and solution, including how to recover the data via a pair of ALTER TABLEs.

That is, both Jose and Spencer had elements of the complete answer.

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I think you have mis-read. SELECT HEX confirmed the database contains double-encoded UTF8. The string contained in the database is exactly what you correctly said it would be if it was double-encoded. No mojibake (but thanks for teaching me a new word!). – Rick Free Jun 26 '15 at 00:30