7

Problem found

The problem seems to be that $mysqli->set_charset() is not accepting `utf8mb4' as a valid encoding (just as I "speculated" in the first update). MySQL version is 5.5.41 and PHP version is 5.4.41 (no problem with that).


Sorry for the title, I've been searching/reading about what/where can the problem be and I'm already too confused about this...

I recently started using utf8mb4 in mysql. I'm using utf8mb4 as charset and utf8mb4_unicode_ci as collation for all tables/columns.

So I first I changed:

$mysqli->set_charset('utf8');

to

$mysqli->set_charset('utf8mb4');

made sure my php files are utf8 (I'm using Visual Studio Code so the files are created in UTF-8 by default), and php/html headers are set to utf8:

index.php

header('Content-type: Text/HTML; Charset=UTF-8');

main.php (included at the end of index.php)

<meta http-equiv="Content-Type" content="Text/HTML" />
<meta charset="UTF-8" />

The problem is that for some tables I have to manually insert the data, and this data is stored as is: with special characters, with accents, ñ, etc... And when I display this data in my website I can see that these characters have replaced the special/accented characters.

So my question is: is there any way to store data as is (without replacing/converting special/accented characters) in mysql and be able to display it fine (as is)?

If I revert to $mysqli->set_charset('utf8'); the data is displayed fine... So this keeps me wondering that there should be no problem with storing utf-8 characters as they are and there is some codification problem somewhere...

I'm using sqlyog community (with wine) and I read somewhere that sometimes the gui does not work correctly when you change some db/table configuration and the only way is the old way (running yourself the query), but I didn't tried this yet. I ran queries to set the charset and collation of all tables/columns.

What do you think?

UPDATE

I'm starting to think that mysqli does not accept utf8mb4 as a valid character encoding and uses utf8 from php and not from mysql... I also think mysql fckd up creating utf8mb4 instead of updating the existing utf8 to support 4 bytes....

As I'm testing with mysqli charset utf8, everything is stored as is and displayed as is (with mysql charset and collation set to utf8mb4...).

UPDATE 2

SELECT name, HEX(name) FROM person LIMIT 1

This is what it outputs:

New Person has name Altaïr 416C7461C3AF72

But as I already said, this is using:

$mysqli->set_charset('utf8');

to insert and to select. If I use utf8mb4 instead this is what it gets stored:

Altaïr

But it's displayed ok. What it's not displayed ok is if the name is stored as is, the displayed name will be Alta�r.

So the question is: Why is mysqli/mysql storing ï as ï using utf8mb4? And why is php displaying special characters like ï as when utf8mb4 is set in mysqli?

Can someone please confirm that mysqli::set_charset accepts utf8mb4 as a valid encoding?

UPDATE 3

I have a class function that selects a string from a table "es", for example: Iniciar Sesión (this is what's stored) and if mysqli charset is utf8, what is being selected/displayed is Iniciar Sesión.

This maybe a completely different problem but it's clearly another codification problem. From my understanding, if the tables/columns are utf8mb4 and mysqli is set to utf8, mysql has to encode from utf8 (3bytes) to ut8mb4 (full byte support). So this means that mysqli does not use utf8 from php but from mysql. Is this correct, right?

My application is currently having a rough time with encodings... (but maybe is some server configuration problem...)

UPDATE 4

Can the problem be here? I really have no idea about this kind of configurations:

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8               |
| character_set_connection | utf8               |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8               |
| character_set_server     | latin1             |
| character_set_system     | utf8               |
| collation_connection     | utf8_general_ci    |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | latin1_swedish_ci  |
+--------------------------+--------------------+
10 rows in set (0.00 sec)

UPDATE 4-1/2 (copied from a comment)

CREATE TABLE `es` (
    id int(11) NOT NULL AUTO_INCREMENT, 
    name varchar(30) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 
    text varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 
    PRIMARY KEY (id), 
    UNIQUE KEY name (name)
) ENGINE=InnoDB AUTO_INCREMENT=76 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci` 
Chazy Chaz
  • 1,781
  • 3
  • 29
  • 48
  • No conversion? You mean BLOB? The rules of thumbs of dealing with UTF8 is : ALWAYS document the conversion + encoding +decoding method. – mootmoot Apr 27 '16 at 16:00
  • What's BLOB? I thought php handled that for me... If php works with utf8 and takes utf8mb4 from mysql, what kind of conversion is needed? – Chazy Chaz Apr 27 '16 at 16:04
  • BLOB is a MySQL data type that is typically used for storing large amounts of text or binary data. See here http://dev.mysql.com/doc/refman/5.7/en/blob.html. – Vadim Apr 27 '16 at 16:16
  • Ah right. No, I don't use BLOB, I just use the basic data types as text and var/char. – Chazy Chaz Apr 27 '16 at 16:17
  • Another thing to check is that you allocate enough space for special characters in your column definition. For example, `varchar(255)` is not big enough to contain 255 special characters. – Vadim Apr 27 '16 at 16:28
  • And YOUR QUESTION say : is there any way to store data as is (without replacing/converting special/accented characters) in mysql? Or you want to change your question to something more appropriate : what is the best practice to store UTF-8/encoded string to mysql using php – mootmoot Apr 27 '16 at 16:28
  • Obviously I prefer the best practice, but I'm asking this because if I have to insert data manually and it contains special characters they are not going to be displayed correctly... So... I just got more confused :P – Chazy Chaz Apr 27 '16 at 16:34
  • @Vadim Fortunately that was before 4.1, now it counts the length as characters and not as bytes http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html – Chazy Chaz Apr 27 '16 at 16:44
  • Please provide `SELECT col, HEX(col) FROM ...` so we can see what the text was stored as. That will tell us whether the text was stored incorrectly or was fetched incorrectly. `�` and `ï` have different causes. – Rick James Apr 27 '16 at 17:53
  • `ñ` and `ï`, are hex `F1` and `EF` in `CHARACTER SET latin1`. They are `C3B1` and `C3AF` in utf8 or utf8mb4 (no difference). – Rick James Apr 27 '16 at 17:56
  • @RickJames Do I just echo the column? – Chazy Chaz Apr 27 '16 at 18:20
  • @RickJames setting the mysqli charset to utf8 i'm getting this: `416C7461C3AF722049626E2D4C612741686164`. And utf8mb4 this: `416C7461C383C2AF722049626E2D4C612741686164`. – Chazy Chaz Apr 27 '16 at 19:42
  • Do the `SELECT` that I suggested after filling in a table name, a `WHERE` clause, and changing `col` to the relevant column name. – Rick James Apr 29 '16 at 00:23
  • @RickJames What do you want me to put in the `WHERE` clause? I updated my question. – Chazy Chaz Apr 29 '16 at 02:52
  • Good... `Altaïr`, when encoded in utf8 (or utf8mb4), is `41 6C 74 61 C3AF 72`. That says that the data is correctly utf8 in the table. What versions of PHP and MySQL are you running? – Rick James Apr 29 '16 at 04:14
  • PHP is 5.4.41 and MySQL is 5.5.41 under Centos 5. – Chazy Chaz Apr 29 '16 at 05:22
  • @RickJames I just noticed that a class function from my framework is printing characters ok only when mysqli charset is utf8mb4 (and the data from the table is utf8). It's a language system, it selects one table or another depending on the browser/cookie/configuration locale... This is getting weirder... – Chazy Chaz Apr 29 '16 at 08:57
  • No, when using mysqli charset utf8mb4 `Altaïr` is `41 6C 74 61 C383C2AF 72`. – Chazy Chaz Apr 29 '16 at 09:16
  • utf8/utf8mb4 hex for `Sesión` is `53 65 73 69 C3B3 6E`. `Sesión` / hex `53 65 73 69 C383 C2B3 6E` implies "double-encoding". – Rick James Apr 30 '16 at 17:31
  • MySQL 5.5.3 is when utf8mb4 was introduced, so you are safe there. PHP probably simply passes the string on without caring. Give this a try after executing `charset`: `printf("Current character set: %s\n", mysqli_character_set_name($link));` – Rick James Apr 30 '16 at 17:39
  • Please provide `SHOW CREATE TABLE es`. – Rick James Apr 30 '16 at 17:41
  • `$mysqli = new mysqli(); $mysqli->set_charset('utf8mb4'); printf("Current character set: %s\n", mysqli_character_set_name($mysqli));` And this is what it prints: `Current character set: latin1`. If I change it to utf8 it says utf8... So I was right lol (first update), `mysqli` is not accepting utf8mb4... – Chazy Chaz Apr 30 '16 at 17:56
  • `SHOW CREATE TABLE es` -> `CREATE TABLE `es` ( `id` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(30) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `text` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL, PRIMARY KEY (`id`), UNIQUE KEY `name` (`name`) ) ENGINE=InnoDB AUTO_INCREMENT=76 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci` – Chazy Chaz Apr 30 '16 at 18:31

3 Answers3

2

The problem might stem from the fact that you're not using utf8mb4 in your MySQL column definition (at least you did not say what encoding you're using).

Here is an example of a MySQL table definition with a column that uses utfmb4:

CREATE TABLE `person` (
  `name` varchar(255) CHARACTER SET utf8mb4
)

UPDATE

Using the following table definition:

CREATE TABLE `person` (
  `name` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

and the following PHP script:

<?php
$mysqli = new mysqli('localhost', 'username', 'password', 'database');
$mysqli->set_charset('utf8mb4');

$mysqli->query("INSERT INTO `person` VALUES ('Altaïr Ibn-La\'Ahad')");

$result = $mysqli->query("SELECT * FROM `person` LIMIT 1");

$person = $result->fetch_object();

if($person)
    printf ("New Person has name %s.\n", $person->name);

$result->close();
$mysqli->close();

when I insert "Altaïr Ibn-La'Ahad" into the database, the name is stored as is without changes. The script also prints the name without changes: "New Person has name Altaïr Ibn-La'Ahad."

I hope this helps you resolve your problem. Let me know if it does or doesn't.

Vadim
  • 1,916
  • 2
  • 19
  • 39
  • I'm using utf8mb4 as charset and utf8mb4_unicode_ci as collation for all tables/columns. – Chazy Chaz Apr 27 '16 at 16:18
  • When you insert data, does it look correct in the database: 1) when inserted directly into the database; 2) when inserted with PHP? – Vadim Apr 27 '16 at 16:24
  • When I insert data directly it looks as is. But when I insert it with php it depends how I treat the user input in php. If I use a filter it converts the especial characters but they are displayed as they sould. For example: I insert the name "Altaïr Ibn-La'Ahad" and this is what is stored "Altaïr Ibn-La'Ahad". If I don't filter it is just the ï what is converted to ï. – Chazy Chaz Apr 27 '16 at 16:37
  • I've updated my original answer with an example. It produces the output you want while storing the data as is. – Vadim Apr 27 '16 at 17:03
  • I've created the table with your query and executed the script. But the name is stored as I last said, the `ï` is stored as `ï`. There has to be a problem with my mysql server... I've marked your answer as the correct, I'll have to look further into the server configuration. – Chazy Chaz Apr 27 '16 at 17:18
  • It is worth mentioning that I used a Mac to write this script. It could be that Windows handles things differently. Also, when I remove `$mysqli->set_charset('utf8mb4');` from my script, `ï` is also replaced with `ï` for me. Maybe the `set_charset` statement is ignored on your end? You also may want to read this page https://mathiasbynens.be/notes/mysql-utf8mb4 on how to support full unicode in MySQL. Finally, you can always try using PDO instead of MySQLi (although that is not guaranteed to solve your problem). Post your solution when you've solved your problem so it can benefit others. :) – Vadim Apr 27 '16 at 17:39
  • Sure, I always post the solutions. I'm using Manjaro (Arch Linux) and the server is centos 5. I prefer using mysqli as it doesn't include the values in the query, they are sent separately or something like that. – Chazy Chaz Apr 27 '16 at 18:02
  • Did you do some special mysql configuration? I'm using mysql 5.5.41. – Chazy Chaz Apr 27 '16 at 19:22
  • If I run your script with `$mysqli->set_charset('utf8');` the name will be stored as is and it will be displayed fine. – Chazy Chaz Apr 27 '16 at 19:43
  • Yes, I don't know why but it's not accepting utf8mb4... Also you know how can I check if mysql is translating from utf8 (php) to utf8mb4? or the encoding that I set in `$mysqli->set_charset()` (utf8 or utf8mb4) is the same? Mysql had to do complicated things... – Chazy Chaz Apr 28 '16 at 19:28
  • Sorry, that wasn't working either. I unaccept your answer because @RickJames and I found the real problem. I'll wait for accept the answer that fixes the problem. – Chazy Chaz Apr 30 '16 at 20:59
2

utf8mb4 Altaïr is 41 6C 74 61 C383C2AF 72

Ouch. That is "double encoding". latin1 EF was converted to utf8/utf8mb4 C3AF; then C3, incorrectly treated as latin1 was converted to C383 and AF to C2AF.

Here's what probably happened:

  • The client had characters encoded as utf8 (good); and
  • SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
  • The column in the table declared CHARACTER SET utf8 (or utf8mb4) (good).

That second step should have been fixed by

$mysqli->set_charset('utf8mb4');

I assume you are not mixing mysql_* and mysqli_* interfaces. Use only the latter.

How about posting a short, reproducible, test case.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Yes I would love to be able to reproduce it but It's probably some problem in my framework/application itselft and I don't know where to start... However, this class function that I have is selecting and displaying everything ok when mysqli charset is utf8mb4, but the rest of my application is showing the �. When the mysqli charset is just utf8 this function won't show the � but the typical html encoding problem, I'm going to update my question to add this. – Chazy Chaz Apr 30 '16 at 16:47
  • Displaying is tricky. Browsers are 'forgiving'; they will try different ways to interpret the bytes and, in some cases, make garbled text look correct. The black diamond occurs when it can't succeed. – Rick James Apr 30 '16 at 17:08
  • If you have stored lots of data "double-encoded", here is info on repairing the data: http://mysql.rjweb.org/doc.php/charcoll#fixing_double_encoding_ But you must also fix the code not to continue storing such. – Rick James Apr 30 '16 at 17:09
  • No, the application is still in development, the only important data are the cities, etc lists but they are inserted correctly (at least I see them fine). If only I could locate the code causing this... Did you see the last update to my question? – Chazy Chaz Apr 30 '16 at 17:14
  • Your mention of `C383C2AF` prompted me to talk about "double-encoding". The difference between `utf8` and `utf8mb4` remains a mystery. "Update 4" looks 'normal'. Executing `SET NAMES utf8mb4` will change 4 of those for the duration of the connection. – Rick James Apr 30 '16 at 17:28
  • Some of what you have mentioned sounds like you `INSERTed` with one setting of `SET NAMES` and `SELECTed` with a different setting. – Rick James Apr 30 '16 at 17:33
  • That's not possible, I only call `$mysqli->set_charset('utf8mb4');` once, in the class db right after I initialize `mysqli`. – Chazy Chaz Apr 30 '16 at 17:52
  • We are both missing something. Work on writing a minimal test case the shows the problem. – Rick James Apr 30 '16 at 18:53
  • `$mysqli->set_charset('utf8mb4');` it's not working... You know why is not accepting utf8mb4? – Chazy Chaz Apr 30 '16 at 19:38
  • "not working" meaning it gives an error message? Or it returns the wrong value from `printf("Current character set: %s\n", mysqli_character_set_name($link));`? Or something else? – Rick James Apr 30 '16 at 19:43
  • Sorry, "not working" meaning that it returns `FALSE` and the default charset `latin1` is used instead. `mysqli_character_set_name($mysqli)` returns `latin1` (I assume is the default charset). – Chazy Chaz Apr 30 '16 at 20:30
  • Let's try another approach. Instead of using `charset()`, execute the command `SET NAMES utf8mb4` through the mysqli interface. Since that is just a string, PHP can't be rejecting it. – Rick James Apr 30 '16 at 23:11
  • Oh, you got FALSE. But what error message? Check mysqli's error message routine. – Rick James Apr 30 '16 at 23:11
  • Run this command: `SHOW CHARACTER SET WHERE Charset LIKE 'utf8%';` You should get 2 rows. If not, then that is a clue. – Rick James Apr 30 '16 at 23:13
  • Right sorry, I forgot it's the same as checking queries errors... I'm trying to get rid of this problem for 2 days now and I couldn't sleep well so I'm a bit slow... this is the error, a problem in my server: `Can't initialize character set utf8mb4 (path: /usr/share/mysql/charsets/)`. So this is a mysql problem? – Chazy Chaz May 01 '16 at 00:41
  • And yes, I get 2 rows from running that query. – Chazy Chaz May 01 '16 at 00:46
  • Hmmm... What's in that path? I would guess one file per charset. But maybe utf8mb4 is missing? Of that path is not readable? Do `ls -ld`, etc to check. Installation problem? utf8mb4 came with 5.5.3, so you should have it. Was there an upgrade from 5.1 recently? Got any other clues? – Rick James May 01 '16 at 00:59
  • I'm asking my friend to check it, he didn't gave me ssh access. I'll tell you what was the problem and how my friend solved it and also update the question so anyone with this problem knows how to fix it. – Chazy Chaz May 01 '16 at 02:36
0

Mistery solved! There was a bad installation/upgrade/config with mysql and utf8mb4 was not properly installed.

The problem with the function was that it was re-encoding the db values with utf8_encode() and somehow it was causing these kind of characters ó -> ó.

Chazy Chaz
  • 1,781
  • 3
  • 29
  • 48
  • 1
    Can you please let us know in detail how you solved this issue. I am having the same issue and pulling out my hair at the moment :( – Waris Ali Jul 15 '16 at 12:02
  • Sorry for the delay, I was enjoying my holidays :) MySQL (in my case MariaDB) was lacking the neccesary files so the encoding didn't exist. The files need to be compiled (I think recompiling with necessary flags) or reinstall a recent version. This happened on an old cent os 5 server, so in more recent versions this shouldn't happen, in fact I installed cent os 6.7 and utf8mb4 was detected without problems. – Chazy Chaz Aug 14 '16 at 12:31