0

mysql recently reported me the following error: [HY000][1366] Incorrect string value: '\xF0\x9D\x98\xBD\xF0\x9D...' for column 'name'

after investigation, I found that the value with weird characters comes from a filename, which apparently contains bold characters: 4 - TV.mp4

Instead of changing the encoding of my database to accept such characters, i'd rather sanitize the value before inserting it, in PHP. But I have no idea which operation I should run to end with the following sanitized value : 4 BANDE ANNONCE - TV.mp4

Any help would be appreciated.

VaN
  • 2,180
  • 4
  • 19
  • 43
  • Does this answer your question? [Regular expression for valid filename](https://stackoverflow.com/questions/11794144/regular-expression-for-valid-filename) – dognose Jun 01 '23 at 12:54
  • not really. the initial question is about c# and anwers provided don't seem to match – VaN Jun 01 '23 at 13:04
  • There's a couple of [solutions](https://stackoverflow.com/a/73729825/231316) out [there](https://stackoverflow.com/a/63068771/231316) that have some replacements via lookup arrays that might be helpful. You could "fix" these using the translations, and then perform a normal replacement for invalid characters – Chris Haas Jun 01 '23 at 13:31
  • These are not the normal Latin letter characters that were somehow "made bold", but these are their own characters - that `` for example is the MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL B (https://www.havirho.eu/Programming/U-1D400-tm-U-1D7FF.htm) So you will have to implement a logic to "translate" those into their corresponding regular letters. – CBroe Jun 01 '23 at 13:46

1 Answers1

2

You can use the PHP iconv function to convert the string from one character encoding to another. In this case, you can try converting the string from UTF-8 to ASCII//TRANSLIT, which will attempt to transliterate any non-ASCII characters into their closest ASCII equivalents.

Here's an example:

function sanitize_string($input_string) {
    $sanitized_string = iconv("UTF-8", "ASCII//TRANSLIT", $input_string);
    return $sanitized_string;
}

$filename = "4   - TV.mp4";
$sanitized_filename = sanitize_string($filename);
echo $sanitized_filename;

This should output 4 BANDE ANNONCE - TV.mp4, which is the sanitized value you're looking for.

Paul Lake
  • 44
  • 1
  • yes, this is the most accurate fix I found also. When writing unit tests, I stumbled upon a problem though: `iconv("UTF-8", "ASCII//TRANSLIT", "Clean File Name.mp4")` will strip the capital letters and return `clean file name.mp4` instead. is there a way to make the "clean" chars untouched ? – VaN Jun 01 '23 at 14:05
  • @VaN, that isn't the case when I try it, but that might be dependent on your system's underlying iconv implementation: https://3v4l.org/hs41h – Chris Haas Jun 01 '23 at 14:12