5

My question is almost exactly the same as the one found here

MySQL - 1300 - Invalid utf8 character string on update

No solution was presented and the help from the person (creating a temporary table) didn't seem to help. Here is the select statement I am using:

SELECT
    CONVERT(line_1 USING utf8mb4),
    CONVERT(line_1 USING latin1),
    HEX(line_1)
FROM address
WHERE ((CAST(CONVERT(line_1 USING latin1) AS CHAR)) <> (CAST(line_1 AS CHAR)))
    AND CONVERT(line_1 USING utf8mb4) IS NULL;
+-------------------------------+------------------------------+----------------------------------------------------+
| CONVERT(line_1 USING utf8mb4) | CONVERT(line_1 USING latin1) | hex(line_1)                                        |
+-------------------------------+------------------------------+----------------------------------------------------+
| NULL                          | Högbergsgatan 97             | 48F6676265726773676174616E203937                   |
| NULL                          | Zücherstrasse 161            | 5AFC636865727374726173736520313631                 |
| NULL                          | 2275, Rue de l'Université    | 323237352C20527565206465206C27556E69766572736974E9 |
| NULL                          | Högbergsgatan 97             | 48F6676265726773676174616E203937                   |
+-------------------------------+------------------------------+----------------------------------------------------+

When I try to run the following update command I get:

UPDATE address
SET line_1 = CONVERT(CAST(CONVERT(line_1 USING latin1) AS CHAR) USING utf8mb4)
WHERE (CAST(CONVERT(line_1 USING latin1) AS CHAR) <> CAST(line_1 AS CHAR))
    AND CONVERT(line_1 USING utf8mb4) IS NULL;
ERROR 1300 (HY000): Invalid utf8mb4 character string: 'F66762'

I tried setting the line in the following ways, all producing the same error:

SET line_1 = CAST(CONVERT(line_1 USING latin1) AS CHAR)
SET line_1 = CONVERT(line_1 USING latin1)

I also looked at http://jonisalonen.com/2012/fixing-doubly-utf-8-encoded-text-in-mysql/ to see if maybe it was a double encoding issue, but none of these worked and I kept getting the same character string error.

Additionally, I looked at https://mathiasbynens.be/notes/mysql-utf8mb4 in order to help with the conversion steps, but utf8mb4 and utf8 are causing the exact same issues. (At first I thought it was a utf8 thing so I switched to utf8mb4 and when I still kept getting the same issues, I knew there was a deeper problem)

As you can see, something weird is going on. Looking at my show create address table, I can verify that the charset is set properly:

SHOW CREATE TABLE address;
| address | CREATE TABLE `address` (
  `addressid` bigint(20) NOT NULL AUTO_INCREMENT,
  `addressuuid` char(32) COLLATE utf8mb4_unicode_ci NOT NULL,
  `line_1` blob,
  PRIMARY KEY (`addressid`)
) ENGINE=InnoDB AUTO_INCREMENT=48970 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='All potential addresses' |

Additionally, you can see my character variables are correct in my instance:

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

How did I get here

So it might be beneficial to give some background information on the issue just in case it is something in the background that caused the issue.

I had a database that originally was set to latin1 encoding everything. I then ran the following code:

SET NAMES 'latin1';

/* We must change things to blob and then back again */
ALTER TABLE `address` CHANGE line_1 line_1 BLOB;
ALTER TABLE `address` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE `address` CHANGE line_1 line_1 VARCHAR(64);

The reason for switching to blob and then back to varchar is the normal recommended procedure. (www.percona.com/blog/2013/10/16/utf8-data-on-latin1-tables-converting-to-utf8-without-downtime-or-double-encoding/)

Let me know if that helps, and if there is any more information that can be provided. I'm using MySQL 5.6 so theoretically it should handle things better, but who knows. As there are only 4 lines I could just manually update each one, but theoretically there is a greater underlying issue and since I do actually have quite a few more columns to go through, it would be nice to make sure I have a functional way to handle these cases in case I get something with quite a few lines.

Community
  • 1
  • 1
Aram Papazian
  • 2,453
  • 6
  • 38
  • 45
  • Start by doing a `SELECT HEX(col), col ...` so we can see what is in the field. What is the goal? Changing the table? Or just getting out something like `Högbergsgatan`? – Rick James Feb 24 '15 at 00:06
  • I've added hex to the table above. The goal is to be able to convert everything from the old character encoding to the new one. But each method I try seems to have problems with strings not being able to convert/invalid strings. – Aram Papazian Feb 24 '15 at 07:57
  • Did you ever get this right? I have the same issue. – Adergaard Apr 25 '15 at 19:06
  • So unfortunately nothing seemed to work properly as some things were double encoded, others single, and it became a massive mess. Luckily our database was not to crazy large so I ended up having to export our DB, I then manually updated all wrong characters, fixed the encoding on the tables, and then reimported. Haven't had trouble since, but it wasn't fun =/ – Aram Papazian May 08 '15 at 14:23

1 Answers1

3

Since line_1 is a blob, not a text field, MySQL has no control over the "characters" in it, and does not care if it is non-text information (such as a JPG). In the examples you gave, you have latin1 text in the field (eg, hex F6 for ö). Hence, CONVERT(line_1 USING latin1) worked 'fine'.

I don't understand your goal. Are you trying to read the BLOB as TEXT? If so, and if all the non-ascii characters are encoded latin1, then that CONVERT is the answer.

If your goal is something else, then let's approach it from there.

It was not "doubly encoded", so none of them would work.

ALTER TABLE address CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Do a SHOW CREATE TABLE address and check the CHARACTER SET of line_1.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • So originally the field was set as VARCHAR(64). The goal is to convert the DB from latin1 to utf8mb4. So I set the names to latin1, altered the field to blob, converted the character set and then tried to convert back to varchar(64). When I tried to convert back I got the following error: Incorrect string value: '\xF6gberg...' for column 'line_1' at row 7578 – Aram Papazian Feb 24 '15 at 18:55
  • 1
    The 2-step-alter-thru-blob is used for when the bytes don't agree with the declaration. The 1-step-ALTER-CONVERT is probably what you needed. (Or it seems that is what you need now.) So, I think you need 3 steps from the current BLOB with latin1 bytes: SET latin1, ALTER to VARCHAR, ALTER CONVERT. – Rick James Feb 24 '15 at 21:17
  • Comparision of the "fixes": http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases – Rick James Jun 12 '20 at 17:09