0

Running tried this with both Python3.7 and Python3.8, with mysql-connector-python 8.0.13 and 8.1.0

MySQL 5.7.42

Collation on the database is set to 'utf8mb4_unicode_520_ci'

Connection from Python is:

db =  None
db = mysql.connector.connect(
    host="localhost",
    user=username,
    passwd=password,
    database=eventdb,
    charset="utf8mb4",
    use_unicode=True
)

cur = None
cur = db.cursor(dictionary=True)

I have a string that comes from a json.dump and attempting to run a parameterized query with it:

data["name"] = '\udced\udca0\udcbe\udced\udcb7\udca1\n\n\udced\udca0\udcbe\udced\udcb7\udca1\n\n♡ADANA♡♡EOMON♡'

sql = "SELECT db_name_id FROM db_name WHERE name = %s"
val = (data["name"],)
curr.execute(sql_text, sql_val)

mysql-connector-python 8.0.13 on both version of Python returns UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-5: surrogates not allowed

mysql-connector-python 8.1.0 on Python.38 returns _mysql_connector.MySQLInterfaceError: Failed converting Python 'str'

However if I execute a simple query:

cur.execute(SELECT db_name_id FROM db_name WHERE name = '\udced\udca0\udcbe\udced\udcb7\udca1\n\n\udced\udca0\udcbe\udced\udcb7\udca1\n\n♡ADANA♡♡EOMON♡')

Then it executes without error, this is a user entered field though and I really DON'T want to be doing the query without parameters.

The simplest example that replicates the exception error I'm seeing is using the C Extension directly:

import _mysql_connector

ccnx = _mysql_connector.MySQL()
ccnx.connect(
                host="localhost",
                user="user",
                password="password",
                database="database"
            )

bad_str = 'just_an_��_example'

try:
    str_converted = ccnx.convert_to_mysql(*[bad_str])
    print('str converted is %s', str_converted)
except Exception as e:
    print('cant convert bad str %s',bad_str)
    print(e)

I've only tested this with mysql-connector-python 8.1.0.

If I make the following change based on information MySQL Bug 99757, then the convert_to_mysql works:

import _mysql_connector

ccnx = _mysql_connector.MySQL()
ccnx.connect(
                host="localhost",
                user="user",
                password="password",
                database="database"
            )
ccnx.set_character_set('utf8')
bad_str = 'just_an_��_example'

try:
    str_converted = ccnx.convert_to_mysql(*[bad_str])
    print('str converted is %s', str_converted)
except Exception as e:
    print('cant convert bad str %s',bad_str)
    print(e)

It seems like the conversion to a mysql string is broken for some cases, including parameterized strings with surrogates. I'm hoping there's just something I missed.

  • Those `DCxx` codes look like Syriac letters ( ܀܁܂܃܄܅܆܇܈܉܊܋܌܍܏ܐܑܒܓܔܕܖܗܘܙܚܛܜܝܞܟܠܡܢܣܤܥܦܧܨܩܪܫܬܭܮܯܱܴܷܸܹܻܼܾܰܲܳܵܶܺܽܿ ) -- Is that what you are expecting? I do not see a problem with MySQL or Unicode, perhaps with Python? I see DCA0 1824=x0720 [ܠ] AL SYRIAC LETTER LAMADH DCA1 1825=x0721 [ܡ] AL SYRIAC LETTER MIM DCB7 1847=x0737 [ܷ] NSM SYRIAC RBASA BELOW DCBE 1854=x073E [ܾ] NSM SYRIAC ESASA BELOW – Rick James Jul 29 '23 at 16:46
  • What is `convert_to_mysql` supposed to do? That is, why do you need to do any type of conversion? – Rick James Jul 29 '23 at 16:47
  • Those `\uDCxx` codes are all low surrogates which gives no sense. Please [edit] your question to improve your [mcve]. In particular, share where that weird string comes from. Please share a code snippet as well as data sample (sanitized if necessary). – JosefZ Jul 29 '23 at 20:48
  • `convert_to_mysql` I believe is a function that mysql-connector uses to encode strings for entry into mysql. I'm not actually sure, it just happens to be a minimal code example that I believe shows there's a potential bug in mysql-connector and how it handles surrogate pairs. – TechnoSwiss Jul 29 '23 at 21:52
  • These are player names entered by users for an online MMORPG game, I'm parsing data from the game and have no control over what the users are entering, or how the game is sanitizing the data. So I'm not sure what those characters are actually supposed to be. – TechnoSwiss Jul 29 '23 at 22:00
  • @JosefZ the realization that these are invalid surrogate pairs, I'll have to look more into the data I'm getting from the game to figure out why those are coming out as invalid. However in the meantime I'm catching the exception and then using `data["name"] = data["name"].encode('utf-8', 'replace').decode('utf-8')` to replace them and that solves the problem. – TechnoSwiss Jul 30 '23 at 05:36

0 Answers0