String encoding issue with pyodbc's returned data

Question

On making query with PyODBC, I am getting my table data as:

 u'\u3836\u3431\u3132\u3230\u3030'

The actual content in my database column is as:

6814210200

When I explicitly encode pyodbc's returned value to utf-16, I get the content as (closest I went):

>>> print d['data'][0]['upc'].encode('utf-16')
 ��6814210200
#^^ two junks

My question is: How can I get the encoded value directly from the PyODBC query?

I already have CHARSET=UTF16 set in my database connection string as:

pyodbc.connect("DRIVER=<driver_name>;" + \
                                 "SERVER=<server_ip>;" +\
                                 "DATABASE=<database>;" +\
                                 "UID=<user>;" +\
                                 "PWD=<password>;" + \
                                 "CHARSET=UTF16",    # setting charset
                                 ansi=True)

Also in all my odbc.ini and odbcinst.ini file, I have set:

 UnicodeTranslationOption = utf16 
 CharacterTranslationOption = all

under my driver's setting.

Is there some special reason that you're using UTF-16? That's not very common, and almost unheard of on Linux systems. What happens if you set `"CHARSET=UTF8"`? — PM 2Ring, Nov 07 '16 at 07:04
I don't know if it's relevant to your particular situation (and I know nothing about PyODBC), but there's a lot of good info about Unicode (both for Python 2 & Python 3) in this article: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. — PM 2Ring, Nov 07 '16 at 07:06
@PM2Ring `"CHARSET=UTF8"` also returns the same result. The reason for using `UTF-16` is I am using Nettezza's Database driver which is based on UTF-16 codec. And thanks for sharing the article, it looks interesting to me. Will take a look. — Moinuddin Quadri, Nov 07 '16 at 07:08
No need to use backslashes there! No need to use + even if strings are constant. Your data doesn't seem to be in UTF-16 - what do you *actually* get when you use `CHARSET=UTF8` - please add the **unmodified `repr` of that value**. — Antti Haapala -- Слава Україні, Nov 07 '16 at 07:19
@AnttiHaapala: I am using `.format()` in each line in actual code, that is the reason to use "+ \". The O/P I have given is the content of `repr()` as `u'\u3836\u3431\u3132\u3230\u3030'`, Without `repr`, it prints `㠶㐱ㄲ㈰〰` (something in Chineese (looks like mandrin)) . And, I get the same content if I mention `CHARSET=UTF8` with the connection string — Moinuddin Quadri, Nov 07 '16 at 08:24

score 2 · Accepted Answer · answered Nov 07 '16 at 06:49

2

You need to specify that you want the little endian version of UTF-16.

s = u'\u3836\u3431\u3132\u3230\u3030'
print s.encode('utf-16le')

output

6814210200

FWIW, in Python 3, s.encode('utf-16le') returns b'6814210200'.

answered Nov 07 '16 at 06:49

PM 2Ring

54,345
6
82
182

I do not want to explicitly encode value of each row and column returned by PyODBC query. I am looking for option to specify encoding in the PyODBC configuration/connection string, which implicitly handles the encoding part. Doing it explicitly will be huge over head. Anyway, +1 as PyODBC query part I have added later in the question. – Moinuddin Quadri Nov 07 '16 at 06:58
3

In that case you may well need to get things sorted out at the database level. – holdenweb Nov 07 '16 at 08:29

String encoding issue with pyodbc's returned data

1 Answers1