23

I have a SQL query that I execute like this with an SQLAlchemy engine:

result = engine.execute('SELECT utf_8_field FROM table')

The database is MySQL and the column type is TEXT with UTF-8 encoding. The type of the returned utf_8_field is "str", even if I set the option convert_unicode=True when creating the engine. What happens now is that if I have a character like 'é' in my string (which is not in 7-bit ASCII, but is in the extended ASCII set), I get a UnicodeDecodeError when trying to execute this:

utf_8_field.encode("utf-8")

The exact error is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)

When looking into this, I found that str.encode do not support the extended ASCII character set! I find this really strange, but that's another question.

What I don't understand is why SQLAlchemy is not giving me a unicode string. I was previously using DB-API and that was working fine. I also don't have SQLAlchemy table objects for my tables yet, that's why I'm using an execute command.

Any idea?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Faelenor
  • 255
  • 1
  • 2
  • 6
  • Just making a note for those using Python3 - both `unicode` & `str` types in Python2 are not considered of type `str` ... – Ricky Levi Oct 11 '20 at 23:55

2 Answers2

48

If you want the data converted automatically, you should specify the charset when you create the engine:

create_engine('mysql+mysqldb:///mydb?charset=utf8')

Setting use_unicode alone won't tell sqlalchemy which charset to use.

mata
  • 67,110
  • 10
  • 163
  • 162
  • Thanks, it now converts my utf-8 field to unicode python strings. – Faelenor May 30 '12 at 16:01
  • 1
    According to the error message the first non-ASCII byte in your "utf-8 field" is `0xe9`. This indicates that it is **NOT** encoded in UTF-8 ... much more likely to be `cp1252`. – John Machin May 30 '12 at 22:03
  • @JohnMachin - yes, because if you don't specify a charset when creating the engine, mysqldb seems to default to latin1. Setting the charset tells both the db ant the client to use the correct transfer encoding and therefore fixes the problem. – mata May 30 '12 at 22:09
  • 2
    WARNING: depending on the version of SQLAlchemy in use (0.8, 0.9, 1.0) and the database (MySQL,PostgreSQL, SQLite) the behavior and the recommendations are different! – Federico Apr 22 '15 at 16:46
3

To convert from an UTF-8 bytestring to a unicode object, you need to decode:

utf_8_field.decode('utf8')

Also, when executing a raw SELECT through .execute, SQLAlchemy has no metadata to work out that your query is returning utf-8 data, so it is not converting this information to unicode for you.

In other words, convert_unicode only works if you use the SQLAlchemy SQL expression API or the ORM functionality.

EDIT: As pointed out, your data is not even UTF-8 encoded; 0xe9 in UTF-8 would indicate a character between \u9000 and \u9fff, which are CJK unified ideographs while you said it was a latin-1 character, whose UTF-8 code would start with 0xc3. This is probably ISO-8859-1 (latin-1) or similar instead:

>>> u'é'.encode('ISO-8859-1')
'\xe9'

The conclusion then is to tell SQLAlchemy to connect with a different character set, using the charset=utf8 parameter, as pointed out by @mata.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • The encode('utf-8') line was there before, when using DB-API. I guess it was because the database returned the string converted to unicode and we wanted it back to utf-8. – Faelenor May 30 '12 at 15:44
  • -1 He doesn't have UTF-8-encoded data. Read his error message. – John Machin May 30 '12 at 22:04
  • @JohnMachin: His error message states that the default codec used to first decode his bytestring (`ascii`) cannot deal with UTF-8.. That's because encoding a bytestring first requires it to be decoded to unicode, which is what failed. – Martijn Pieters May 30 '12 at 23:00
  • @MartijnPieters: It states no such thing. It states that 0xe9 is not ASCII. It fails on the first non-ASCII byte. 0xe9 is a valid lead byte in UTF-8 -- for a CJK character. He mentions the character é in the "extended ASCII set" i.e. U+00E9 LATIN SMALL LETTER E WITH ACUTE. This indicates that his data is encoded in Latin-n or cp125n, not UTF-8. – John Machin May 31 '12 at 00:11
  • Right! And seeing that `?charset=utf8` solved it for OP indicates that the database is probably configured to default to a different character set is another piece of evidence. Still, calling `encode` on a `str` is still a common mistake that leads to this class of errors. – Martijn Pieters May 31 '12 at 07:23