2

More than three years ago I posted a similar question related to not being able to insert data into a table because of what I believe is an incorrect character set, and I'm having the same problem with the same script with different data today.

The issue is with German characters from Ilpo J�rvinen ("a" with umlaut?).

execute failed: Incorrect string value: '\xD6sterl...' with mariadb and perl DBD

DBD::mysql::st execute failed: Incorrect string value: '\xE4rvine...' for column `lsv6`.`xyz_content`.`fulltext` at row 1 at myscript.pl line 467.

I'm trying to process emails with different charsets, but apparently this one is not being encoded by my script properly. From the email:

Content-type: text/plain; charset="iso-8859-1"
Content-transfer-encoding: quoted-printable

The full script is too long to post here, but I believe these are the relevant parts where the database is involved.

sub db_connect($) {
    ...
    return DBI->connect("DBI:mysql:database=$DB{'db'};host=$DB{'host'}", $DB{'user'}, $DB{'pass'}, { PrintError => 1, RaiseError => 1, mysql_enable_utf8mb4 => 1 } )
    ...
}

$fullText               = $dbh->quote($fullText);
my $sql                 = <<EOF;
INSERT INTO xuxgc_content (title, alias, introtext, `fulltext`, state, catid, created, created_by, created_by_alias, modified, modified_by, checked_out, checked_out_time, publish_up, publish_down, images, urls, attribs, version, ordering, metakey, metadesc, metadata, access, hits, language)
VALUES ($title, "$title_alias", $introText, $fullText, $state, $catid, $created, $created_by, $created_by_alias, $modified, $modified_by, $checked_out, $checked_out_time, $publish_up, $publish_down, $images, $urls, $attribs, $version, $ordering, $metakey, $metadesc, $metadata, $access, $hits, $language);
EOF

my $sth = $dbh->prepare($sql);
$sth->execute();
db_disconnect($dbh);

If I strip out the non-ASCII characters, it works properly:

$fullText               =~ s/\xE4//g;                                                                                                                                                                              
$fullText               =~ s/\xFC//g;                                                                                                                                                                              
$fullText               =~ s/\xE5//g;                    

Should I be using iso-8859-1 encoding instead of utf8mb4?

Alex Regan
  • 477
  • 5
  • 16
  • Does it work if you add `$fullText = decode( "iso-8859-1", $fullText );` – ikegami Oct 22 '22 at 01:21
  • Yes, that worked, thanks. How much of the other decoding/quoting do I need to do? I changed the encoding type to utf8mb4 in my DBI->connect - do I need to know what the charset of the email that is being used? – Alex Regan Oct 23 '22 at 02:09

1 Answers1

2

$fullText is a string of text encoded using iso-8859-1. It needs to be decode into a string of text before working with it.

And DBD::mysql expects strings to be encoded appropriately for the connection. In your case, this is UTF-8.[1] So you need to encode the query into a string of bytes before passing it to DBD::mysql.

# Convert bytes to text
my $fullText = decode( "iso-8859-1", $fullText_latin1 );

# Convert text to SQL string literal
my $fullText_lit = $dbh->quote( $fullText );

my $sql = "... $fullText_lit ...";

# Convert text to bytes
my $sql_utf8 = encode( "UTF-8", $sql );

my $sth = $dbh->prepare( $sql_utf8 );

Note that DBD::mysql suffers from The Unicode Bug.

Perl has two internal storage formats for strings: "downgraded" and "upgraded". It shouldn't matter which internal storage format is used for a string. But DBD::mysql (and every other DBD I know) access the internal string buffer without checking which of the two storage format is used. Code that does this is said to suffer from The Unicode Bug.

So what I said earlier isn't quite right. DBD::mysql expects the internal buffer of the strings to be encoded appropriately for the connection. encode always produces a string in the downgraded format, and the internal buffer of such strings always contain the string exactly.

This means that as long as you encode the strings passed to DBD::mysql using encode and the proper encoding, you won't have problems.


  1. One of the effects of passing mysql_enable_utf8mb4 => 1 as a connection option is to set the encoding connection to UTF-8 as if you had used SET NAMES utf8mb4.
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I think there is something I'm still not understanding. Can I ask you to lay this out more clearly for me? This particular email is encoded using ISO-8859-1 to support the binary characters, but others are encoded using UTF-8. It appears you are saying to encode the text, but the example you gave above was to decode it. Do I need to distinguish between the two to decide whether to encode/decode? It also appears Joomla requires UTF-8, which is why I'm using mysql_enable_utf8mb4, correct? – Alex Regan Oct 23 '22 at 22:08
  • 1
    Re "*This particular email is encoded using ISO-8859-1 to support the binary characters,*", This makes no sense. iso-8859-1 is a way to encode text. It says that `a` is 0x61, `b` is 0x62, etc – ikegami Oct 23 '22 at 22:48
  • 1
    Re "*It also appears Joomla requires UTF-8, which is why I'm using mysql_enable_utf8mb4*", This does two things: 1) Sets the text encoding used for communication with the database to UTF-8. This is good, because it means you can exchange any Unicode character with the db. 2) It causes strings coming from the database to be automatically decoded. /// BUT. It does not cause strings passed to the database to be automatically encoded, thus the need to `encode` showed in the answer. /// It's a poor API because it dates back to before Perl's Unicode support. – ikegami Oct 23 '22 at 22:51
  • 1
    Re "*It appears you are saying to encode the text, but the example you gave above was to decode it.*", You always want to work with decoded text. So you need to decode inputs that are encoded. I said you needed to decode the body of the email, and I did that. (The library that parsed the message should have done that for you, though.) So you need to encode outputs that won't be encoded for you. I said you needed to encode strings passed to the database, and I did that. (Similarly, DBD:;mysql should do that for you, but it doesn't for the historical reason mentioned above.) – ikegami Oct 23 '22 at 22:55