UTF-8 decoding when reading a MySQL column of type JSON (vs. TEXT), using Perl DBI + DBD::mysql

Question

Here is the problem, in a working unit test. I think it's either a bug in DBI + DBD::mysql, with respect to how it handles MySQL JSON columns, or a bug in my brain.

use strict;
use warnings;

use utf8;

use Test2::V0;
use Test2::Plugin::UTF8;
use Test2::Plugin::NoWarnings echo => 1;

use DBI;
use DBD::mysql 4.041;  # 4.041+ required for utf8mb4
use JSON 4.01 qw//;
use Encode;

#
# setup $dbh, create test table
#
my $dbname = '';
my $host = 'localhost';
my $user = '';
my $pass = '';

my $dbh = DBI->connect(
  "DBI:mysql:" . ($dbname || '') . ";host=" . $host,
  $user, 
  $pass || undef, 
  { RaiseError => 1, PrintError => 0, AutoCommit=> 1 }
);

$dbh->{'mysql_enable_utf8mb4'} = 1;
$dbh->{'charset'} = 'utf8';

$dbh->do(
  "CREATE TABLE IF NOT EXISTS `test` ("
  . "id int unsigned, "
  . "`my_json` json NOT NULL, "
  . "`my_text` mediumtext NOT NULL "
  . ") ENGINE=InnoDB "
  . "DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci"
);

#
# create and insert test data
#

# A use case for spam! Got this junk from spam inbox
my $utf8str = "ion?• été eulière vs• Ch• ⭐Sho是ab 期待您x";  
my $hash = { test => $utf8str };
my $json = JSON->new->encode( $hash );

my $id = time;
$dbh->do("INSERT INTO test SET id=?, my_json=?, my_text=?", undef, $id, $json, $json);

#
# retrieve test data and check it
#
my ( $my_json, $my_text ) = $dbh->selectrow_array("SELECT my_json, my_text FROM test WHERE id=$id");

is( $my_text, $json );  # ok

is( $my_json, $json );  # fails. got {"test": "ion?\nâ\N{U+80}¢ Ã©t .... 

is( decode('UTF-8', $my_json), $json );  # ok'ish. mysql adds a space between "test":"..." but value looks ok

#
# another test, independent of JSON encoder, using hand-built json
#
$id++;
$json = '{"test":"' . $utf8str . '"}';
$dbh->do("INSERT INTO test SET id=?, my_json=?, my_text=?", undef, $id, $json, $json);

( $my_json, $my_text ) = $dbh->selectrow_array("SELECT my_json, my_text FROM test WHERE id=$id");

is( $my_text, $json );  # ok

is( $my_json, $json );  # fails. got {"test": "ion?\nâ\N{U+80}¢ Ã©t .... 

printf "%vX", $my_json;  # 7B.22.74.65.73.74.22.3A.20.22.69.6F.6E.3F.E2.80.A2.20.C3.A9.74.C3.A9.20.65.F0.9F.98.8D.F0.9F.92.8B.F0.9F.94.A5.75.6C.69.C3.A8.72.65.20.76.73.E2.80.A2.20.43.68.E2.80.A2.20.F0.9F.98.8A.E2.AD.90.F0.9F.91.89.F0.9F.8F.BB.F0.9F.94.9E.F0.9F.8D.86.53.68.6F.E6.98.AF.61.62.20.E6.9C.9F.E5.BE.85.E6.82.A8.78.22.7D

printf "%vX", $json;  # 7B.22.74.65.73.74.22.3A.22.69.6F.6E.3F.2022.20.E9.74.E9.20.65.1F60D.1F48B.1F525.75.6C.69.E8.72.65.20.76.73.2022.20.43.68.2022.20.1F60A.2B50.1F449.1F3FB.1F51E.1F346.53.68.6F.662F.61.62.20.671F.5F85.60A8.78.22.7D

is( decode('UTF-8', $my_json), $json );  # ok'ish. mysql adds a space between "test":"..." but value looks ok

#
# cleanup
#
$dbh->do("DROP TABLE `test`");

$dbh->disconnect();

done_testing();

My understanding is that the JSON standard requires UTF-8. In addition, MySQL also requires/uses UTF-8 with regard to JSON columns, as described here:

MySQL handles strings used in JSON context using the utf8mb4 character set and utf8mb4_bin collation. Strings in other character sets are converted to utf8mb4 as necessary. (https://dev.mysql.com/doc/refman/8.0/en/json.html)

My understanding is also that DBI handles the UTF-8 encoding/decoding, and should be returning decoded UTF-8 as it is doing for the mediumtext column, as stated here:

This attribute determines whether DBD::mysql should assume strings stored in the database are utf8. This feature defaults to off.

When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string. https://metacpan.org/pod/DBD::mysql#mysql_enable_utf8

However, it appears not to be for JSON columns. Explicit decoding appears to be required after retrieving data from a JSON column.

So which is it... bug in DBI/DBD::mysql, or bug in my brain?

EDIT: Good news, it's not my brain. Bad news, appears to be a known bug. https://github.com/perl5-dbi/DBD-mysql/issues/309

So, the answer I'm seeking now is a backward-compatible workaround, i.e., a workaround that won't break if/when DBD::mysql is fixed. Double-decoding would not be good.

`JSON->new->encode` expects strings of Code Points, but you provided a string encoded using UTF-8 instead. You need `use utf8;` to tell Perl your source code is encoded using UTF-8 so that it can decode it properly. There may be other problems. — ikegami, Jan 15 '22 at 19:47
`decode('UTF-8', $my_json)` makes no sense. `$dbh->{'mysql_enable_utf8mb4'} = 1;` should have decoded it already. — ikegami, Jan 15 '22 at 19:48
It would help if you print the various variables using `printf "%vX"`. — ikegami, Jan 15 '22 at 19:50
$dbh->{'mysql_enable_utf8mb4'} = 1 doesn't appear to work for JSON columns. After posting, I found this: https://github.com/perl5-dbi/DBD-mysql/issues/309, so now I'm wondering how to fix this issue in a way that is backward-compatible if/when DBD::mysql gets fixed. — yahermann, Jan 15 '22 at 19:56

score 1 · Accepted Answer · answered Jan 16 '22 at 00:01

So, the answer I'm seeking now is a backward-compatible workaround, i.e., a workaround that won't break if/when DBD::mysql is fixed. Double-decoding would not be good.

You could try to determine if the JSON decode bug is present by creating a test table where you insert a non-ascii character that has a known UTF-8 encoding with byte length greater than one. For example:

$dbh->do("DROP TABLE IF EXISTS json_decode_test");
$dbh->do("CREATE TABLE json_decode_test (id int unsigned, `my_json` json NOT NULL)");
my $unicode_str = "是";  # This character will always have a UTF-8 encoding with
                         # byte length > 1
my $hash = { test_str => $unicode_str };
my $json = JSON->new;
my $json_str = $json->encode( $hash );
my $id = time;
my $attrs = undef;
$dbh->do("INSERT INTO json_decode_test SET id=?, my_json=?", $attrs, $id, $json_str);
my ( $json_str2 ) = $dbh->selectrow_array(
    "SELECT my_json FROM json_decode_test WHERE id=$id");
my $hash2 = $json->decode( $json_str2 );
my $unicode_str2 = $hash2->{test_str};
# If the json unicode bug is present, $unicode_str2 will not be decoded. Instead
#  it will be a string of length 3 representing the UTF-8 encoding of $unicode_str
#  (printf "%vX\n", $unicode_str2) gives output : E6.98.AF
my $json_unicode_bug = (length $unicode_str2) > 1;
if ( $json_unicode_bug ) {
    say "unicode bug is present..";
    # need to run decode_utf8() on every returned json object from DBI
}

Something like this will work as a unit test, thx!! I was hoping for an on-the-fly workaround. I'm thinking about using `utf8::is_utf8` to detect whether the string has already been decoded. — yahermann, Jan 17 '22 at 21:40
You are welcome. I do not think `utf8::is_utf8()` will work here, see [this](https://stackoverflow.com/a/14584079/2173773) answer — Håkon Hægland, Jan 17 '22 at 21:43
Yep, I agree `ut8::is_utf8` should not be used in general. The question I was considering was whether it would work reliably for THIS specific case of retrieving an unencoded JSON column from MySQL using the present buggy version of DBD::mysql. I can't convince myself that it will, because there's no way to reliably know how `utf8::is_utf8()` would act in a fixed DBD::mysql, i.e. a DBD::mysql that doesn't yet exist! — yahermann, Jan 17 '22 at 21:53
The complete and best available solution appears to be 1. when retrieving, `decode( 'UTF-8', $my_json )` prior to json-decoding, and 2. add a unit test (as you suggested) to confirm bug is acting as predicted, in case DBD::mysql is fixed at a later date. Does that seem right? — yahermann, Jan 17 '22 at 21:59

score 0 · Answer 2 · answered Jan 16 '22 at 21:07

0

Change SQL request to create test table as following

my $query = "
CREATE TABLE IF NOT EXISTS `test` (
    `id`       INT UNSIGNED,
    `my_json`  JSON NOT NULL,
    `my_text`  MEDIUMTEXT NOT NULL
) ENGINE=InnoDB DEFAULT
  CHARSET=utf8mb4
  COLLATE=utf8mb4_unicode_ci
";

$dbh->do($query);

answered Jan 16 '22 at 21:07

Polar Bear

6,762
1
5
12

Doesn't work. Still getting undecoded `my_json` – yahermann Jan 17 '22 at 21:28
To be clear, I added `ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci` to the test and re-ran, got same result. For consistency I added the above to the original question. Thx for suggesting – yahermann Jan 17 '22 at 21:37

score 0 · Answer 3 · answered Jan 17 '22 at 03:45

F0.9F.98.8D -- UTF-8 encoding -- this is good
1F60D -- Unicode codepoint -- not useful in `MEDIUMTEXT`

Do

SELECT HEX(my_json) FROM test WHERE id = ...

to see what is inside MySQL; it should be F09F988D.

You should not be encoding or decoding any string that is written to or read from MySQL.

UTF-8 decoding when reading a MySQL column of type JSON (vs. TEXT), using Perl DBI + DBD::mysql

3 Answers3