7

This is probably something simple. I swear I've been looking online for the answer and haven't found it. Since my particular case is a little atypical I finally decided to ask here.

I have a few tables in MySQL that I'm using for a Chinese language program. It needs to be able to support every possible Chinese character, including rare ones that don't have great font support. A sample cell in the table might be this:

東菄鶇䍶倲涷蝀凍鯟崠埬䰤

In order to get that to work right in the database, I've had to set the encoding/collation to utf8mb4. So far so good. Unfortunately when I pull the same string into PHP, it gets printed as this:

東菄鶇䍶?倲??涷蝀凍鯟?崠埬?䰤

How can I finally kill off the remaining question marks and get them to show as the unicode glyphs they should be? I've got the php page itself using UTF8 encoding in the tag and as a meta tag.

Why can't they communicate with each other? What am I doing wrong?

Yhilan
  • 269
  • 1
  • 3
  • 15
  • Are you setting the database connection encoding? If so, how and to what? Are the characters actually converted somewhere or is it just a display problem? What does `bin2hex()` on that string give you in PHP? – deceze Oct 23 '12 at 10:40

2 Answers2

16

I'd simply guess that you are setting the table to utf8mb4, but your connection encoding is set to utf8. You have to set it to utf8mb4 as well, otherwise MySQL will convert the stored utf8mb4 data to utf8, the latter of which cannot encode "high" Unicode characters. (Yes, that's a MySQL idiosyncrasy.)

On a raw MySQL connection, it will have to look like this:

SET NAMES 'utf8mb4';
SELECT * FROM `my_table`;

You'll have to adapt that to the best way of the client, depending on how you connect to MySQL from PHP (mysql, mysqli or PDO).


To really clarify (yes, using the mysql_ extension for simplicity, don't do that at home):

mysql_connect(...);
mysql_select_db(...);
mysql_set_charset('utf8mb4');     // adapt to your mysql connector of choice

$r = mysql_query('SELECT * FROM `my_table`');

var_dump(mysql_fetch_assoc($r));  // data will be UTF8 encoded
deceze
  • 510,633
  • 85
  • 743
  • 889
  • I'm accessing from PHP, and currently using `mb_internal_encoding('UTF-8');` Does that affect the solution? Is there such a thing as `mb_internal_encoding('UTF-8MB4');`? – Yhilan Nov 08 '12 at 12:36
  • No, `mb_internal_encoding` only sets the internal encoding for the `mb_` functions, it has nothing to do with MySQL. You'll get normal UTF-8 back from MySQL, `utf8mb4` is only an internal thing to MySQL. – deceze Nov 08 '12 at 13:25
  • So how would I use SET NAMES from within the PHP? Or am I going about this wrong and should somehow set it from within something like phpmyadmin? – Yhilan Nov 09 '12 at 14:48
  • See update. Hope it's clear now. Setting the connection encoding to `utf8mb4` makes PHP receive UTF-8 encoded data. – deceze Nov 09 '12 at 14:59
4

Just to add to @deceze's answer, I recommend a well-configured MySQL server (for me, in /etc/mysql/mysql.conf.d/mysqld.cnf). Here are the configuration options to make sure you're using utfmb4, although I do recommend going through every MySQL configuration option though, daunting as it is, there are a lot of defaults that are are very non-optimal.

[client]

default-character-set           = utf8mb4

[mysql]

default_character_set           = utf8mb4

[mysqld]

init-connect                    = "SET NAMES utf8mb4"
character-set-client-handshake  = FALSE
character-set-server            = "utf8mb4"
collation-server                = "utf8mb4_unicode_ci"
autocommit                      = 1
block_encryption_mode           = "aes-256-cbc"

That last one is just one that should be default. Also, init-connect deals with not having to execute that every time. Keeps code clean. Now run:

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

You should return something like the following:

+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8mb4            |
| character_set_connection | utf8mb4            |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8mb4            |
| character_set_server     | utf8mb4            |
| character_set_system     | utf8               |
| collation_connection     | utf8mb4_unicode_ci |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | utf8mb4_unicode_ci |
+--------------------------+--------------------+

And looks like you're doing this already, but doesn't hurt to explicitly define on table creation:

CREATE TABLE `mysql_table` (
  `mysql_column` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`mysql_column`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4;

Hope this helps someone.

Eugene
  • 1,539
  • 12
  • 20
  • don't forget to restart mysql with `service mysqld start` or with `/etc/init.d/mysqld restart` after changing the config file – salvob Nov 28 '16 at 10:03
  • 1
    Why are there settings with the same name but with hyphens or underscores, and quotes or no quotes? – Aaron Surrain Mar 15 '17 at 22:00