7

I have a rails app that receives data from an Android device. I noticed that some of the data, when in Japanese, is not saved correctly. It shows up as literal question marks (not the diamond ones) in the MySQL client and in the rails website.

It turns out that the database that I have connected to the rails app is set to Latin1. Rails is set to UTF-8.

I read a lot about character encodings, but they all mention that the data is somehow a bit readable. Mine however is only literal question marks. Also trying to convert the data to UTF-8 using several methods on the web doesn't change a thing. I suspect that the data is converted to question marks when it's written to the database.

Sample output from the MySQL console:

select * from foo where bar = "foobar";
+-------+------+------------------------+---------------------+---------------------+
| id    | name | bar                    | created_at          | updated_at          |
+-------+------+------------------------+---------------------+---------------------+
| 24300 | ???? | foobar                 | 2012-01-23 05:04:22 | 2012-01-23 05:04:22 |
+-------+------+------------------------+---------------------+---------------------+
1 row in set (0.00 sec)

The input data, that my rails app got from the Android client was:

name = 爆笑笑話

This input data has been verified to exist in the rails app before saving to the database. So it's not mangled in the Android client or during transfer to the server. Is there any chance I can get this data back? Or is it completely lost?

dda
  • 6,030
  • 2
  • 25
  • 34
Peterdk
  • 15,625
  • 20
  • 101
  • 140
  • 1
    What do you get if you `SELECT HEX(name) FROM foo`? If it's "3F3F3F...", your data is toast; if it's something else, there might be a chance of saving it. –  Dec 22 '12 at 01:20
  • It's indeed 3f3f3f, i'll check it with some other data that was inserted recently, and can't be caused by some old backup I restored or so. – Peterdk Dec 22 '12 at 02:03
  • If that's the case, the database literally just contains question marks -- the original data has been irretrievably lost. eggyal's advice on what you can do going forward looks good. –  Dec 22 '12 at 02:05
  • Yes, it is also in the records that are written today. Very strange, you would expect rails 3 to handle it somehow correctly, instead of losing the data this way. – Peterdk Dec 22 '12 at 02:06
  • I had the same issue and I wrote my solution with details [here](http://dba.stackexchange.com/a/96322/14447) – Fabio Mar 26 '15 at 12:45

1 Answers1

10

It's actually very easy to think that data is encoded in one way, when it is actually encoded in some other way: this is because any attempt to directly retrieve the data will result in conversion first to the character set of your database connection and then to the character set of your output medium—therefore you should first verify the actual encoding of your stored data through either SELECT BINARY name FROM foo WHERE bar = 'foobar' or SELECT HEX(name) FROM foo WHERE bar = 'foobar'.

Where the character is expected, you will likely find either of the following byte sequences:

  • 0xe78886, indicating that your column actually contains UTF-8 encoded data: this usually happens when the character set of the database connection over which the text was originally inserted was set to latin1 but actually UTF-8 encoded data was sent.

    You must be seeing ? characters when fetching the data because something between the data storage and the display has been unable to transcode those bytes (however, given that MySQL thinks they represent 爆 and those characters are likely available in most character sets, it's unlikely that it's occurring within MySQL itself—unless you're explicitly adjusting the encoding information during retrieval).

    Anyway, if this is the case, you need to drop the encoding information from the column and then tell MySQL that the data is actually encoded as UTF-8. As documented under ALTER TABLE Syntax:

    Warning 

    The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

    ALTER TABLE t1 CHANGE c1 c1 BLOB;
    ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
    

    The reason this works is that there is no conversion when you convert to or from BLOB columns.

  • 0x3f, indicating that the database does actually contain the literal character ? and your original data has been lost: this doesn't happen easily, since MySQL usually throws error 1366 if implicit transcoding results in loss of data. Perhaps there was some explicit transcoding in your insert statement?

    In this case, you need to convert the storage encoding to a suitable format, then update or re-insert the data:

    ALTER TABLE foo CONVERT TO utf8;
    UPDATE foo SET name = _utf8 '爆笑笑話' WHERE bar = 'foobar';
    
eggyal
  • 122,705
  • 18
  • 212
  • 237
  • Unfortunately the table contains 3f3f3f. So somehow rails doesn't work very nicely with my database, and the data has been lost. It's not very critical, but still too bad. – Peterdk Dec 22 '12 at 02:08
  • Also the data is immediately lost when I just update the field with the _utf8 command. Is this normal? (not having converted anything yet) – Peterdk Dec 22 '12 at 02:15
  • What is the character set of your database connection: `SHOW VARIABLES LIKE 'character_set%';`? What is the character set of the `name` column: `SELECT CHARSET(name) FROM foo;`? – eggyal Dec 22 '12 at 02:16
  • 1
    `character_set_connection = utf8`, `character_set database = latin1`, character set column is latin1 – Peterdk Dec 22 '12 at 02:23
  • Now when I have converted the table to utf8 it immediately recognizes the same input query as correct utf8 and stores it that way. Still weird. – Peterdk Dec 22 '12 at 02:23
  • It's odd because I get an error when column is latin1 and I attempt to insert characters that cannot be transcoded. What version of MySQL? – eggyal Dec 22 '12 at 02:24
  • `mysql Ver 14.14 Distrib 5.1.66, for debian-linux-gnu (i486) using readline 6.1` on ubuntu 10.04 lts – Peterdk Dec 22 '12 at 02:25
  • And the server is also 5.1.66? – eggyal Dec 22 '12 at 02:43
  • I assume, it's the same package. (running commandline also on server, so yes) – Peterdk Dec 22 '12 at 02:49
  • Intriguing. Will investigate more tomorrow, but off to bed shortly! – eggyal Dec 22 '12 at 02:52