6

How can I remove characters from a string that are not supported by MySQL's utf8 character set? In other words, characters with four bytes, such as "", that are only supported by MySQL's utf8mb4 character set.

For example,

C = -2.4‰ ± 0.3‰; H = -57‰

should become

C = -2.4‰ ± 0.3‰; H = -57‰

I want to load a data file into a MySQL table that has CHARSET=utf8.

ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
Matthias Munz
  • 3,583
  • 4
  • 30
  • 47
  • I hear what you are asking, but why not convert the column to `CHARACTER SET utf8mb4`? – Rick James Jan 11 '17 at 02:02
  • utf8mb4 is preferred, if you don't want to damage data. utf8 is 2 bytes and just truncates the UTF-8 byte sequence, if I remember correctly. BTW utf8mb4 is default in newer versions of MySQL. No need to configure it in `my.cnf`. MariaDB is different. – Helmut Wollmersdorfer Mar 07 '22 at 01:02
  • Check the column definitions of your DB-schema. If there still are some on utf8 (or Latin1?) convert them. Perl handles everything perfect if you `db_connect` with `mysql_enable_utf8 => 1`. – Helmut Wollmersdorfer Mar 07 '22 at 01:09

1 Answers1

11

MySQL's utf8mb4 encoding is what the world calls UTF-8.

MySQL's utf8 encoding is a subset of UTF-8 that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).

Reference

So, the following will match the unsupported characters in question:

/[^\N{U+0000}-\N{U+FFFF}]/

Here are three different techniques you can use clean your input:

1: Remove unsupported characters:

s/[^\N{U+0000}-\N{U+FFFF}]//g;

2: Replace unsupported characters with U+FFFD:

s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;

3: Replace unsupported characters using a translation map:

my %translations = (
    "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
    # ...
);

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;

For example,

use utf8;                              # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';   # Terminal and files use UTF-8.

use strict;
use warnings;
use 5.010;               # say, //
use charnames ':full';   # Not needed in 5.16+

my %translations = (
   "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
   # ...
);

$_ = "C = -2.4‰ ± 0.3‰; H = -57‰";
say;

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;

Output:

C = -2.4‰ ± 0.3‰; H = -57‰
εC = -2.4‰ ± 0.3‰; εH = -57‰
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • If I understand right, I have to convert the utf8 characters (e.g. the 4 byte char "") into BMP before applying your regex. Is there a function to do this? – Matthias Munz Jan 10 '17 at 16:37
  • 1
    I don't know what you mean by converting "" into BMP. The BMP is a set of characters that doesn't include "". You could create a mapping of characters outside of the BMP to characters inside of the BMP. I've just added a demonstration of this to my answer. – ikegami Jan 10 '17 at 16:47
  • thanks for your comment, but I meant something else: I would like to remove the 4 byte utf8 chars by using your regex s/[^\N{U+0000}-\N{U+FFFF}]//g; However when I apply it to my file (which contains a "" instead of its BMP representation) the 4 byte chars are not filtered out. – Matthias Munz Jan 10 '17 at 16:53
  • Are you sure you have decoded text (as opposed to UTF-8)? – ikegami Jan 10 '17 at 16:54
  • No, I think it's not decoded. How can I do this? For example my file contains this phrase "C = -2.4‰ ± 0.3‰; H = -57‰" – Matthias Munz Jan 10 '17 at 16:55
  • For that to be possible, your source file would have to be encoded using UTF-8 and contain `use utf8;`. Is that the case? – ikegami Jan 10 '17 at 16:56
  • I am little confused right now. Maybe you can give a explanation what I have to do with the file, after reading the text in to the memory? – Matthias Munz Jan 10 '17 at 17:00
  • You said you had `"C = -2.4‰ ± 0.3‰; H = -57‰"`. Now you say you are reading it from a file. You'll need to decode the file. You can specify an `:encoding` layer when you open the file, or you could use the `open` pragma as I did in the example I've added to my answer. – ikegami Jan 10 '17 at 17:02
  • Now got it, with replacement is even better. Perfect. Thanks a lot! – Matthias Munz Jan 10 '17 at 17:25