7

I can read from a MSSQL database by sending queries in python through pypyodbc.

Mostly unicode characters are handled correctly, but I've hit a certain character that causes an error.

The field in question is of type nvarchar(50) and begins with this character "" which renders for me a bit like this...

-----
|100|
|111| 
-----

If that number is hex 0x100111 then it's the character supplementary private use area-b u+100111. Though interestingly, if it's binary 0b100111 then it's an apostrophe, could it be that the wrong encoding was used when the data was uploaded? This field is storing part of a Chinese postal address.

The error message includes

UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data

Here it is in full...

Traceback (most recent call last):   File "question.py", line 19, in <module>
    results.fetchone()   File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 1869, in fetchone
    value_list.append(buf_cvt_func(from_buffer_u(alloc_buffer)))   File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 482, in UCS_dec
    uchar = buffer.raw[i:i + ucs_length].decode(odbc_decoding)   File "/VIRTUAL_ENVIRONMENT_DIR/lib/python2.7/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data

Here's some minimal reproducing code...

import pypyodbc

connection_string = (
    "DSN=sqlserverdatasource;"
    "UID=REDACTED;"
    "PWD=REDACTED;"
    "DATABASE=obi_load")

connection = pypyodbc.connect(connection_string)

cursor = connection.cursor()

query_sql = (
    "SELECT address_line_1 "
    "FROM address "
    "WHERE address_id == 'REDACTED' ")

with cursor.execute(query_sql) as results:
    row = results.fetchone() # This is the line that raises the error.
    print row

Here is a chunk of my /etc/freetds/freetds.conf

[global]
;   tds version = 4.2
;   dump file = /tmp/freetds.log
;   debug flags = 0xffff
;   timeout = 10
;   connect timeout = 10
    text size = 64512

[sqlserver]
host = REDACTED
port = 1433
tds version = 7.0
client charset = UTF-8

I've also tried with client charset = UTF-16 and omitting that line all together.

Here's the relevant chunk from my /etc/odbc.ini

[sqlserverdatasource]
Driver = FreeTDS
Description = ODBC connection via FreeTDS
Trace = No
Servername = sqlserver
Database = REDACTED

Here's the relevant chunk from my /etc/odbcinst.ini

[FreeTDS]
Description = TDS Driver (Sybase/MS SQL)
Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so
CPTimeout =
CPReuse =
UsageCount = 1

I can work around this issue by fetching results in a try/except block, throwing away any rows that raise a UnicodeDecodeError, but is there a solution? Can I throw away just the undecodable character, or is there a way to fetch this line without raising an error?

It's not inconceivable that some bad data has ended up on the database.

I've Googled around and checked this site's related questions, but have had no luck.

vowel-house-might
  • 1,686
  • 14
  • 18
  • 1
    Have you seen http://stackoverflow.com/questions/18357675/unicodedecodeerror-unexpected-end-of-data? – snakecharmerb Apr 18 '16 at 12:42
  • 1
    What do you get if you `SELECT master.sys.fn_varbintohexstr(CONVERT(VARBINARY, [address_line_1])) AS foo FROM [address] ...` ? That should show you exactly what is at the beginning of the text value. – Gord Thompson Apr 18 '16 at 12:46
  • @GordThompson 0x4700520045004e00410044004a00c400520047004100540041004e002000 – vowel-house-might Apr 18 '16 at 15:52
  • Interesting. Those are definitely valid NVARCHAR bytes (UTF-16LE) but they represent 'GRENADJÄRGATAN ' which doesn't look particularly Chinese to me. Still, that's what pyodbc should return. – Gord Thompson Apr 18 '16 at 16:22
  • @GordThompson My apologies, that was from the previous line. – vowel-house-might Apr 18 '16 at 17:20
  • @GordThompson 0xc0db11dd53004300490045004e0043004500260054004500430048004e00 – vowel-house-might Apr 18 '16 at 17:20
  • You're right, that first character (four bytes) is U+100111. Unfortunately, I am unable to recreate your issue under Python 2.7.11 with pyodbc 3.0.7. When I `print(row)` I get `(u'\U00100111SCIENCE&TECHN', )`. However, I'm using Windows and the "SQL Server Native Client 10.0" ODBC driver. Still, it does suggest that it's not a problem with Python or pyodbc *per se*. Are you using reasonably current versions of Python 2.7, pyodbc, and FreeTDS? – Gord Thompson Apr 18 '16 at 21:29
  • ...also, just tried with pypyodbc 1.3.3 and that worked okay, too. – Gord Thompson Apr 18 '16 at 22:35

2 Answers2

2

I fixed the issue myself by using this:

conn.setencoding('utf-8')

immediately before creating a cursor.

Where conn is the connection object.

I was fetching tens of millions of rows with fetchall(), and in the middle of a transaction that would be extremely expensive to undo manually, so I couldn't afford to simply skip invalid ones.

Source where I found the solution: https://github.com/mkleehammer/pyodbc/issues/112#issuecomment-264734456

Max Candocia
  • 4,294
  • 35
  • 58
  • Interesting, but MSSQL uses UTF-16LE, which pyodbc also uses by default, so I wouldn't expect specifying UTF-8 to help under normal circumstances. Still, if it seems to have helped you then it might help others who really do have UTF-8 stuffed into an MSSQL database. – Gord Thompson Feb 05 '18 at 19:34
  • I am using Azure SQL, which may have a few quirks that make it different? Either way, I find it bizarre, since the only way data gets into that database is through the same codebase/`pyodbc` driver that I was using to fetch the data. – Max Candocia Feb 05 '18 at 19:43
0

This problem was eventually worked around, I suspect that the problem was that text had a character of one encoding hammered into a field with another declared encoding through some hacky method when the table was being set up.

vowel-house-might
  • 1,686
  • 14
  • 18