3

I have a site that accepts form-based inputs from users and stores into a mysql(i) database. 99% of users are from the USA.

If a user is on a mobile device and decides it would be fun to use an emoji in one of the form fields, the resulting SQL causes a failure when the UPDATE or INSERT is attempted. This is due to the emoji being outside utf-8 and therefore a mis-match of encoding between the UX and the DB. I believe I understand the underlying issue.

I'm on mysql 5.5 so I could actually support utf8mb4 char set -- but it'll be some time before I can get around to that.

In the meantime, would it be ok/safe to just do this (in PHP):

$sql = "INSERT INTO my_table ... *long insert statement across two dozen rows*";
$sql = utf8_decode($sql);
$db->sql_query($sql);

So - simply force the entire SQL statement to utf-8? I realize emoji characters and some others may be converted to "?" -- but as long as the statement itself isn't broken or I don't lose normal text inputs from the users, I'll be ok with this.

Any issues with this idea as a temp fix until I can convert the table's schemas to work with utf8mb4?

EDIT: Error being returned from the database looks like this:

"Incorrect string value: '\xF0\x9F\x99\x8A...' for column..."
C C
  • 419
  • 1
  • 4
  • 18
  • 1
    UTF-8 covers all Unicode, also the emoji. However emoji are there since version 6.0 of Unicode. The weak point seems more to be the fonts not having those characters. Maybe replace `[\u1F300-\u1F5F]`. – Joop Eggen Jan 05 '15 at 00:17
  • Thanks, but that will get complicated very quickly. These emoji characters on mobile devices are not restricted to a single range of characters. I'd have to maintain some sort of map, I think. – C C Jan 05 '15 at 01:04
  • @Joop If there's an actual error being thrown ("failure"...?), then it's hardly a font issue. – deceze Jan 05 '15 at 01:20

1 Answers1

3

Some facts:

  • many emoji are above the BMP (basic multilingual plane) in Unicode, i.e. above code point U+FFFF
  • MySQL's utf8 charset can only represent the BMP, it is not actually full UTF-8 (yay MySQL)
  • for full Unicode support MySQL has utf8mb4
  • there should be no actual error when trying to store characters above the BMP in a utf8 charset column, MySQL will simply discard the unsupported characters (yay silent data corruption)
  • utf8_decode does not "force everything to UTF-8", rather it converts a string from UTF-8 encoding to ISO-8859-1 ("Latin-1") encoding; this will discard many more characters than just emoji

So it looks to me like you have some other underlying problem if you get an actual error thrown at some point. Maybe you think you're talking to your database in UTF-8 when you're actually not. You need to set your connection charset via mysqli_set_charset.

"Filtering out" characters your database does not support should already happen, MySQL will simply discard those characters. If you want to do this manually in PHP, you could do this to filter out all characters above U+FFFF:

$string = preg_replace_callback('/./u', function (array $m) {
    return strlen($m[0]) > 4 ? null : $m[0];
}, $string);

Overall: supporting utf8mb4 takes about two minutes. You just need to set your table/column charset to utf8mb4 and do the same with your mysqli connection. If you're then sending actual UTF-8 data from your PHP app to your database, you'll be storing emoji just fine.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • found this in the database abstraction layer code: @mysqli_query($this->db_connect_id, "SET NAMES 'utf8'"); so, it looks like the connection is begin set to 'utf8', right? I looked at the schema for the tables and it says that the table collation is "utf8_bin". Does this info give any more clue as to why I am getting that string error? – C C Jan 05 '15 at 01:55
  • I think we are back to mysql being limited to 3-byte characters even when communication and schema are utf-8 -- and that emoji character I was testing is a 4-byte character. Do you agree that is what is causing the error to be thrown? – C C Jan 05 '15 at 02:03
  • Using a `SET NAMES` query instead of `mysqli_set_charset` may be an issue. This is essentially setting the connection encoding behind mysqli's back; mysqli has no idea that you're trying to communicate with the database via `utf8` and may screw up the encoding client-side when escaping data. – deceze Jan 05 '15 at 02:05
  • ok thanks...then I'm officially SOL on this approach -- this system is built on top of PHPBB; I am using their database abstraction code and don't want to start changing it. Sigh. – C C Jan 05 '15 at 02:19
  • Let's put it this way: I'm not terrifically surprised that phpBB has sh*tty db code... ;o) – deceze Jan 05 '15 at 02:25