0

I recently modified the DDL of a VIEW with Workbench by adding a filter with strings of Chinese characters. The DDL can be saved successfully and worked perfectly right after being modified. However, the modification itself in the DDL became garbled symbols several hours later. The character_set_database and the character set of some columns were originally utf8(utf8_general_ci). After getting the error, I have made all of them utf8mb4 with default collation (utf8mb4_0900_ai_ci). The garbled symbols did change after altering the character set but they are still garbled. Any idea?

Example:
DDL right after modification: where m.NAME not in ('王曉明','張小英')
DDL several hours after modification: where m.NAME not in ('???D?','??\?')

Environment:
MYSQL 8.0.13 Community Server - GPL
Windows 10 professional 64bit (Tradition Chinese; Character set of typing output: UNICODE)
Workbench 8.0.13

Show variables like '%char%'
Result:
character_set_client utf8mb4
character_set_connection utf8mb4
character_set_database utf8mb4
character_set_filesystem binary
character_set_results utf8mb4
character_set_server utf8mb4
character_set_system utf8
character_sets_dir C:\Program Files\MySQL\MySQL Server 8.0\share\charsets\

Henry Tsai
  • 21
  • 2
  • After testing, what happened during "the several hours" was that a batch file executes automatically. It drops a table used in the view and create the table again. After the process, the VIEW DDL went wrong. SQL CODES drop table if exists sincelastmonth; create table sincelastmonth as select * from ...; – Henry Tsai Apr 27 '20 at 04:15
  • Why not just recreate the view? – kmoser Apr 27 '20 at 06:50
  • Recreate the view only fix the problem temporarily. In the view, I used a table called sincelastmonth which got updated 3 times a day through a windows batch file. In the batch file, I drop the existing sincelastmonth and create a new sincelastmonth again from a much bigger table.(DROP-AND-CREATE IS THE PROCESS CAUSED THE PROBLEM) This way, I can speed up the view and other processes because they deal with only data since last month (not data for years). – Henry Tsai Apr 28 '20 at 02:33

1 Answers1

0

Please provide a reference to Windows UNICODE -- we need to determine whether it is really "Unicode code points" or actually "UTF-8". If you can provide a hex dump of a little bit of text, I can deduce the answer from that.

More specifically, 王曉明張小英, encoded in UTF-8 (MySQL's utf8 or utf8mb4) is hex

E78E8B E69B89 E6988E E5BCB5 E5B08F E88BB1

(Spaces added to separate characters.) For Unicode (MySQL's UCS2):

738B 66C9 660E 5F35 5C0F 82F1

So, if you get the second hex, then you need to declare that the client is using ucs2, not utf8mb4. Meanwhile, it is quite reasonable for the columns in your tables to be utf8mb4. (And I recommend such.)

"Character set" versus "collation": utf8mb4 is a "character set"; it determines the "encoding" if the bytes. utf8mb4_0900_ai_ci is a "collation"; it determines the sorting order of characters. You have an encoding problem, not a sorting problem.

"several hours after modification" -- This reminds me of the derivation of the computer term "bug". Most of the original computers were built of vacuum tubes. Moths were attracted to the lights emanating for the tubes. They sometimes caused hardware problems.

Hex A4FDBEE5A9FA is the Big5 encoding for 王曉明.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • In windows, I typed 王曉明 in the "text Input" in the website https://www.online-toolz.com/tools/text-hex-convertor.php and the "Hex output" is e78e8be69b89e6988e. For testing, I recreated the VIEW with SQL syntax "where m.NAME <> '王曉明'", which worked perfectly. However, after dropping and recreating the table sincelastmonth, the string '王曉明' in the VIEW DDL became 'xA4xFDxBExE5xA9xFA'. At this point, when I try to edit the DDL, the Workbench showed a warning saying "Failed to create/alter VIEWNAME", "bad conversion". – Henry Tsai Apr 28 '20 at 02:48
  • Tradition Chinese Windows 10 comes with a built-in Chinese input method editor (IME) which allows you to choose from the two character sets of output, specifically BIG5 and UNICODE. The screenshot can be found online https://www.ddm.org.tw/ddm_input/images/win10_04.JPG – Henry Tsai Apr 28 '20 at 03:37
  • hex A4FDBEE5A9FA is the "Big5" encoding for that string. MySQL also understand that `CHARACTER SET`. – Rick James Apr 28 '20 at 04:01
  • The problem was related to the CMD in Traditional Chinese Windows which uses BIG5 encoding by default. Although I didn't put any Chinese character strings in the batch file, the DDL of my VIEWs got affected (a bug?). After I added a command line "chcp 65001" to the beginning of the batch file, forcing the CMD to use UTF-8, the problem disappeared. James, thank you for pointing out that A4FDBEE5A9FA is a Big5 encoding, which made me to take a closer look at the windows CMD, a strange program still using Big5 by default in 2020. – Henry Tsai Apr 28 '20 at 07:36