-1

On making query with PyODBC, I am getting my table data as:

 u'\u3836\u3431\u3132\u3230\u3030'

The actual content in my database column is as:

 6814210200

When I explicitly encode pyodbc's returned value to utf-16, I get the content as (closest I went):

>>> print d['data'][0]['upc'].encode('utf-16')
 ��6814210200
#^^ two junks

My question is: How can I get the encoded value directly from the PyODBC query?

I already have CHARSET=UTF16 set in my database connection string as:

pyodbc.connect("DRIVER=<driver_name>;" + \
                                 "SERVER=<server_ip>;" +\
                                 "DATABASE=<database>;" +\
                                 "UID=<user>;" +\
                                 "PWD=<password>;" + \
                                 "CHARSET=UTF16",    # setting charset
                                 ansi=True)

Also in all my odbc.ini and odbcinst.ini file, I have set:

 UnicodeTranslationOption = utf16 
 CharacterTranslationOption = all

under my driver's setting.

Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
  • Is there some special reason that you're using UTF-16? That's not very common, and almost unheard of on Linux systems. What happens if you set `"CHARSET=UTF8"`? – PM 2Ring Nov 07 '16 at 07:04
  • I don't know if it's relevant to your particular situation (and I know nothing about PyODBC), but there's a lot of good info about Unicode (both for Python 2 & Python 3) in this article: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. – PM 2Ring Nov 07 '16 at 07:06
  • @PM2Ring `"CHARSET=UTF8"` also returns the same result. The reason for using `UTF-16` is I am using Nettezza's Database driver which is based on UTF-16 codec. And thanks for sharing the article, it looks interesting to me. Will take a look. – Moinuddin Quadri Nov 07 '16 at 07:08
  • No need to use backslashes there! No need to use + even if strings are constant. Your data doesn't seem to be in UTF-16 - what do you *actually* get when you use `CHARSET=UTF8` - please add the **unmodified `repr` of that value**. – Antti Haapala -- Слава Україні Nov 07 '16 at 07:19
  • @AnttiHaapala: I am using `.format()` in each line in actual code, that is the reason to use "+ \". The O/P I have given is the content of `repr()` as `u'\u3836\u3431\u3132\u3230\u3030'`, Without `repr`, it prints `㠶㐱ㄲ㈰〰` (something in Chineese (looks like mandrin)) . And, I get the same content if I mention `CHARSET=UTF8` with the connection string – Moinuddin Quadri Nov 07 '16 at 08:24

1 Answers1

2

You need to specify that you want the little endian version of UTF-16.

s = u'\u3836\u3431\u3132\u3230\u3030'
print s.encode('utf-16le')

output

6814210200

FWIW, in Python 3, s.encode('utf-16le') returns b'6814210200'.

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • I do not want to explicitly encode value of each row and column returned by PyODBC query. I am looking for option to specify encoding in the PyODBC configuration/connection string, which implicitly handles the encoding part. Doing it explicitly will be huge over head. Anyway, +1 as PyODBC query part I have added later in the question. – Moinuddin Quadri Nov 07 '16 at 06:58
  • 3
    In that case you may well need to get things sorted out at the database level. – holdenweb Nov 07 '16 at 08:29