2

I have a migration that is running some custom code that depends on unicode characters. I am currently using SQLAlchemy 1.1.9 and Alembic 1.0.2.

I can see my database and table have all the right settings:

mysql> SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| utf8mb4                  | utf8mb4_general_ci   |
+--------------------------+----------------------+

and

mysql> SHOW TABLE STATUS where name like 'mytable';
+---------+-----+--------------------+----------+----------------+---------+
| Name    | ... | Collation          | Checksum | Create_options | Comment |
+---------+-----+--------------------+----------+----------------+---------+
| mytable | ... | utf8mb4_unicode_ci |     NULL |                |         |
+---------+-----+--------------------+----------+----------------+---------+

I have inserted a string, Nguyễn Johñ (note that the e and n are both unicode characters). When I have my flask application load the row, it properly loads. But when I run the migration, I see alembic debug logs showing Nguy?n Johñ and my own debug logs printing the same thing.

Why are some unicode characters converted to a question mark? (Note testing other characters, I see some characters in the terminal, some escaped, such as "\xa0", and others as "?".

The following might be significant too.

  • The URL sent to engine = create_engine() has the utf8 charset
  • I have the following code for running the migration:
from sqlalchemy.sql import table, column
from sqlalchemy import String, Integer, Boolean, Date, Unicode
MyTable = table('mytable',
    column('id', Integer),
    column('test1', Unicode(collation='utf8mb4_unicode_ci')),
    column('test2', Unicode),
)

...
def upgrade():
   ...
   bind = op.get_bind()
   session = orm.Session(bind=bind)
   rows = session.query(MyTable).all()
   print(rows)
  • The debug logs also show the following, but I am not sure if this is just alembic's own feature detection code:
INFO  [sqlalchemy.engine.base.Engine] show collation where `Charset` = 'utf8' and `Collation` = 'utf8_bin'
INFO  [sqlalchemy.engine.base.Engine] ()
DEBUG [sqlalchemy.engine.base.Engine] Col ('Collation', 'Charset', 'Id', 'Default', 'Compiled', 'Sortlen')
DEBUG [sqlalchemy.engine.base.Engine] Row ('utf8_bin', 'utf8', 83, '', 'Yes', 1)
Charles L.
  • 5,795
  • 10
  • 40
  • 60
  • 1
    It's likely that your strings are working just fine, but your console font doesn't include those glyphs, and your log file is not in UTF8 format. – Tim Roberts Jan 19 '23 at 20:13
  • Thanks for the suggestion, I was ran a select on the specific row and it output the correct content. It didn't even escape any unicode characters, which is interesting. – Charles L. Jan 19 '23 at 22:17

1 Answers1

2

Found the issue, it was a small, but very important, oversight.

I had engine = create_engine() taking in the DB URL w/ "charset=utf8", however, config.set_main_option('sqlalchemy.url', db_url) was not getting the charset.

Once the charset parameters was pased in both places, everything started working.

I can also confirm that Unicode(collation='utf8mb4_unicode_ci') is not necessary, using just Unicode() is good enough.

And DEBUG log lines still say "utf8_bin" and "utf8" so those I assume are doing some sort of feature detection.

I am only using sqlite and mysql, so I used this snippet:

if not db_url.startswith("sqlite"):
    sep = "&" if "?" in db_url else "?"
    db_url = "{db_url}{sep}charset=utf8".format(db_url=db_url, sep=sep)

I know this works for my env, but can be adapted for others. Others may not need to prevent adding the charset for sqlite, and may not encryption parameters sometimes added to the DB URL.

Charles L.
  • 5,795
  • 10
  • 40
  • 60