0

I'm trying to work with a Hebrew database, unfortunately the output is gibberish. What am I doing wrong?

# -*- coding: utf-8 -*-
import pypyodbc 
conn = pypyodbc.connect('Driver={Microsoft Access Driver (*.mdb)};DBQ=C:\\client.mdb')
cur = conn.cursor()
cur.execute('''SELECT * FROM Client''')
d = cur.fetchone()
for field in d:
    print field

If I look at cur.fetchone():

'\xf0\xf1\xe0\xf8', '\xe0\xe9\xe0\xe3'

Output:

αΘαπ
2001
εδßΘ
αΘ°σ
RoyEsh
  • 245
  • 1
  • 4
  • 10
  • I'm not too sure about Unicode encodings, but it looks like it might have encoded it in something other that UTF-8 or that there's some kind of offset between fields and unicode strings. `\xf0` is the start of a 4-byte UTF-8 string, but Hebrew characters should all be 2-byte and have a binary representation starting with `1100xxxx`. – Kyle_S-C Mar 14 '15 at 00:10
  • Might it be in [Windows 1255 encoding](https://msdn.microsoft.com/en-gb/goglobal/cc305148)? – Kyle_S-C Mar 14 '15 at 00:14

2 Answers2

2

If either of נסאר or איאד is meaningful, then try:

field.decode('cp1255')

Google Translate suggests this might correspond to a person named Iyad Nassar.

Kyle_S-C
  • 1,107
  • 1
  • 14
  • 31
  • it does. I really don't understand why it works on your machine, but on mine I get "UnicodeEncodeError: 'ascii' codec can't encode characters in position..." – RoyEsh Mar 14 '15 at 00:24
  • I'm using PyCharm IDE to represent things for me, with `# coding: utf-8` at the top, like you. It's definitely encoded in the Windows 1255 encoding then. It's a bit of a pain, but each hex number corresponds to a single Hebrew character or vowel mark. I also have Hebrew installed as a language on Windows so that I could help my partner with a Hebrew language corpus study. – Kyle_S-C Mar 14 '15 at 00:28
  • Perhaps this might help: https://pythonhosted.org/kitchen/unicode-frustrations.html – Kyle_S-C Mar 14 '15 at 00:42
0

try use:

field.encode('utf-8')
Roy Shmuli
  • 4,979
  • 1
  • 24
  • 38