0

I have a MySQL dump, which is not valid UTF-8. Two questions:

  1. Could this be caused by some of the database using utf8mb3 aka MySQL's 'utf8'? It definitely does use this encoding.
  2. If so, how can I fix it, without having access to MySQL to import, change table types, and re-export? Can I use any encoding conversion tool?

Edit to add specific data which isn't valid UTF-8:

uconv -f utf8 a.sql -o /dev/null

Conversion to Unicode from codepage failed at input byte position XXX. Bytes: ed Error: Illegal character found

Here is a hex sample.

xxd -s {XXX-16} -l 30 a.sql
YYY: 6e2e 203c 2f70 3e20 cfa1 ecaf a6eb 9ea0  n. </p> ........    
     aabb aabb aabb aabb aaaa bbbb bbaa aaaa
YYZ: edb6 b0e1 aea5 ee9e a027 2c27 3230       .........','20
     ^^^^ ^^

Edit 2: Added more context above. Also looks like the problem sequence meets UTF-8 format, it just maps to U+1DDB0 which doesn't exist.

Zachary Vance
  • 752
  • 4
  • 18
  • Unclear. Let's see the HEX for a small portion of the "invalid dump". `utf8` is a subset of `utf8mb4`, so utf8mb3 will not cause trouble with either. `SHOW VARIABLES LIKE 'char%'`. – Rick James Oct 13 '20 at 05:54
  • 1
    "which is not valid UTF-8" - please provide an example how you come to this conclusion, so we can exclude a potential wrong one. – AmigoJack Oct 13 '20 at 09:39
  • Example added. @RickJames can you give a citation/link for your claim? I understand "utf8" characters are a subset of "utf8mb4" characters, but I don't have a spec for "utf8mb3" to verify that one encoding is a subset of the other encoding. – Zachary Vance Oct 14 '20 at 00:42
  • [utf8mb3](https://en.wikipedia.org/wiki/UTF-8#MySQL_utf8mb3). Can you add at least 4 bytes in front of those? `a6` rather qualifies for the 2nd byte, so the previous ones are interesting, too. Maybe [CESU-8](https://en.wikipedia.org/wiki/CESU-8) applies. – AmigoJack Oct 15 '20 at 10:48
  • I did add leading bytes--"a6" is not the offending byte :) – Zachary Vance Oct 16 '20 at 23:22
  • OK added a little more context and did the lookup for you – Zachary Vance Oct 17 '20 at 00:50
  • @AmigoJack CESU-8 seems to always start "eda" rather than "edb" based on reading that. – Zachary Vance Oct 17 '20 at 00:57

2 Answers2

0

Your first line YYY consists, as per UTF-8, of the following characters (all valid sequences):

U+006E  n  6e        LATIN SMALL LETTER N
U+002E  .  2e        FULL STOP
U+0020     20        SPACE
U+003C  <  3c        LESS-THAN SIGN
U+002F  /  2f        SOLIDUS
U+0070  p  70        LATIN SMALL LETTER P
U+003E  >  3e        GREATER-THAN SIGN
U+0020     20        SPACE
U+03E1  ϡ  cf a1     GREEK SMALL LETTER SAMPI
U+CBE6  쯦  ec af a6  (Hangul)
U+B7A0  랠  eb 9e a0  (Hangul)

Your second line YYZ of the following: starting with a sequence that is technically correct, but logically illegal, because surrogates are disallowed to be used, especially when unpaired (no corresponding high surrogate). The "private use" sequence/character is allowed, but suspicious:

U+DDB0  �  ed b6 b0  (low surrogate = illegal, reserved for UTF-16)
U+1BA5  ᮥ  e1 ae a5  SUNDANESE VOWEL SIGN PANYUKU
U+E7A0    ee 9e a0  (private use = highly unlikely it is used intentionally)
U+0027  '  27        APOSTROPHE
U+002C  ,  2c        COMMA
U+0027  '  27        APOSTROPHE
U+0032  2  32        DIGIT TWO
U+0030  0  30        DIGIT ZERO

Logically the characters make no sense either (one Greek, two Korean...). It also doesn't fit any of the ANSI encodings:

  • a6 almost always translates to ¦ and I haven't encountered this character in years
  • af almost always translates to ¯ which is mostly used for optical reasons, but even then rarily only once
  • 9e is rarily used at all, only few encodings map it
  • a0 is everywhere the non breaking space, which nobody is able to type directly, but might come from copying texts from i.e. MS Word

From the SQL context you can see all those characters are still within a String literal (see apostrophes), and its content seems to be HTML (see angle brackets). As those 17 bytes cannot store much information anyway just use a hex editor and overwrite every byte with 20 (space).

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
-1

I "solved" this using:

uconv --from-code utf8 --from-callback substitute --to-code utf8 a.sql -o a.sql.fixed

Which simply substitutes a default character for any invalid sequences. man unconv shows several other options, like escaping or deleting invalid sequences.

In my case there only appeared to be a very small number of errors, so I was more interested in being able to process the dump, than correctly identifying what had happened in these cases.

Edit: By using --from-callback skip I could count the number of invalid sequences, by comparing the length of the input and output files.

Zachary Vance
  • 752
  • 4
  • 18
  • Your question sounded like you wanted to recover, instead of cutting. It's unfair to accept this answer, as it solves a different problem. – AmigoJack Nov 05 '20 at 21:53
  • This answer is not great, I agree. I appreciate your effort, but it didn't answer the questions better. This uses a tool instead of manual hex editing, so I decided it was better. Also, this question is 1 month old, and I feel at some point I should just mark something as solved. – Zachary Vance Nov 05 '20 at 22:04
  • "how can I fix it" is bound to context, and there was no hint that cutting/deleting improper sequences was an option. Maybe in your mind, but I'm sitting here on the outside. Ask someone else to read your question, then discuss possible definitions of "fix it". – AmigoJack Nov 05 '20 at 22:19