2

I have this MySQL table in production that is of charset latin1_swedish_ci ( aka latin1 ) .

Right now, there is this incoming content( String : "\ud55c\ubc24\uc758" ) in a UTF-8 format that needs to be inserted into this TEXT column field called keywords in the table.

When I try to perform the INSERT, I get this error :

Incorrect string value: '\xED\x95\x9C\xEB\xB0\xA4...' for column 'keywords' at row 1

I have tried all kinds of ways in my Java code to try to convert from UTF8 to ISO-8859-1 like this below and I am still getting the same error :

String convertedString = new String(originalString.getBytes("UTF-8"), "ISO-8859-1");

I know there are solutions on StackOverflow that mentions to change the charset of the MySQL table to UTF8 from latin1, and I unfortunately cannot do that because this is a live production MySQL master server and also it has historically been using latin1.

Does anyone have any suggestions to fix this "Incorrect string value" error?

Thanks IS

user1805458
  • 1,081
  • 3
  • 9
  • 21
  • You need to change the encoding _in Java_ from usc2 to utf8. `ISO-8859-1` is nowhere involved in what you show. You were hoping for `한밤`, correct? – Rick James May 06 '16 at 03:18
  • Hi Rick James, thanks for the fast reply. What do you mean by changing the encoding from usc2 to utf8? Sorry, I am not following. And yes, I would like to get to 한밤 – user1805458 May 06 '16 at 13:58
  • So there used to be an old Python script that would read this unicode text from another MySQL database(in utf-8) and write it into this latin1 MySQL database , and this text would show up as í•œë°¤ì˜ If I can figure out how to convert the "\ud55c\ubc24\uc758" into " í•œë°¤ì˜ " in my Java code then I am good – user1805458 May 06 '16 at 14:02

2 Answers2

0

What you're trying to do simply isn't possible, unless the characters in the utf8 string also happen to have representations in latin1... and latin1 is a tiny single-byte character set (fewer than 256 possible characters, total), so the vast majority of valid utf8 characters have no equivalent latin1 representation.

You can't store any character in the column that the character set of the column doesn't support. It's not a matter of "converting" from one to the other.

If you need unicode, you need at least a utf8 column, and modifying the table is the only alternative. Trying to do otherwise is like trying to store a negative number in an unsigned integer column. Unsigned ints can't be negative -- it's not a matter of conversion.

This would be true of any RDBMS that supports character data types, and is not a limitation specific to MySQL.

Michael - sqlbot
  • 169,571
  • 25
  • 353
  • 427
  • Hi Michael, thanks for reply. So there used to be an old Python script that would read this unicode text from another MySQL database(in utf-8) and write it into this latin1 MySQL database , and this text would show up as í•œë°¤ì˜ If I can figure out how to convert the "\ud55c\ubc24\uc758" into " í•œë°¤ì˜ " in my Java code then I am good – user1805458 May 06 '16 at 14:02
0

한밤 is the Mojibake for 한밤 -- that is where it got converted to latin1 at some stage. But \ud55c\ubc24 is Unicode. What mode is Python in? Do you have this at the beginning?

# -*- coding: utf-8 -*- 

More Python checklist.

More

utf8 is preferred; euckr is possible. But... The problem is not in picking the character set, it is in being consistent throughout the application in specifying that character set.

Are you using Python? It is tagged Java?

For Java/JDBC, you need ?useUnicode=yes&characterEncoding=UTF-8 in the getConnection() call.

You need these:

  • The bytes in your client need to be utf8, such as hex ED959C. (Korean characters are all 3 bytes in utf8.)
  • The connection between the client and the server needs to be utf8. Performing SET NAMES utf8 right after connecting is another way to do that.
  • The column/table needs to be CHARACTER SET utf8.
  • If you are using html, it will need <meta charset=UTF-8>.

For Korean, utf8mb4 is as good as utf8. Check those 4 bullet items above, and 'prove' to us that you are doing all of them.

For JSP and Java Servlets, slightly different advice is warranted.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Hi Rick, yes, I do see "# -*- coding: utf-8 -*-" at the top of the Python file – user1805458 May 07 '16 at 06:42
  • Anything useful in the checklist link? – Rick James May 07 '16 at 17:58
  • Unfortunately not yet, but I did try to use a different conversion in the Java code. When I try to convert to Cp1252 instead of ISO-8859-1 using "new String(keywords.getBytes("UTF-8"), "Cp1252") " , I noticed something interesting – user1805458 May 08 '16 at 09:19
  • In Java, when I convert the Unicode Korean characters to Cp1252, I get 한밤ì� which triggers the "Incorrect string value: '\xEF\xBF\xBD\xCB\x9CT...' for column 'keywords' at row 1" error. However, when it was written by Python into MySQL, it was 한밤ì , and that writes fine into the MySQL latin1 column! The only difference is that last character, and I am still trying to figure out how I can get to the same result for Java as it were in Python – user1805458 May 08 '16 at 09:23
  • CP1252 has 256 different characters. Korean has about 11,000. To put it bluntly, Korean cannot be represented in CP1252, so do _not_ use CP1252 for Korean text. – Rick James May 09 '16 at 05:22
  • MySQL's latin1 does not check anything; it will take the bytes you give it and store them. The _single_ character `한` is represented as _3 bytes_ in utf8 (hex ED959C). If those 3 bytes are stored in a latin1 column, they are treated as these 3 latin1 _characters_ `한`. Avoid latin1, too. – Rick James May 09 '16 at 05:26
  • I see. So if CP1252 and ISO-8859-1 (latin1) are both out, are there any other options for the Java conversion in "new String(originalString.getBytes("UTF-8"), "Some charset");" to still be able to convert 한 to 한 ? These are all the possible options for charset right: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html – user1805458 May 09 '16 at 18:46
  • I am using Java 8 using the Jersey REST framework. The Python code is what I am trying to replace because it is very old and inefficient. I have been using the "useUnicode=yes&characterEncoding=UTF-8" when I connect to my MySQL database, and it is still gives the "Incorrect String Value" error. The client that sends the HTTP POST to the Java server is POST-ing in utf-8. As for the actual MySQL column itself that is in latin1, we unfortunately cannot change it to utf-8 – user1805458 May 09 '16 at 23:28
  • You must find a way to change the column to utf8. If Jersey won't let you do it, then get rid of Jersey. – Rick James May 10 '16 at 00:43