0

I'm building an application that works with the db of another application, using mssql 2005 (I can't change that or change the existing table definition). w the mssql table collate is: "hebrew bin", and the application shows perfect hebrew from the table, all the py files are encoded utf-8

notice! there is no problem writing with unicode hebrew strings to the db with mssql. there is no trouble choosing and deleting: DBSession2.query(object).filter(object.LOADED=='Y').delete() but when selecting from the table I get this very annoying error:

  File "D:\Python27\learn\agent\agent\lib\encodings\utf_8.py", line 16, in decode     return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: invalid continuation byte

where the exact byte code changes according to the first byte of the first row in the table.

yes, I know that this translates to an hebrew letter - this shouldn't be a problem since all sides play with unicode - at least that is what I thought.

btw - this worked fine on the test mssql 2005 server but doesn't work with the production server.

some code: this is the failing part of the function:

def iterateJson(parser,injson,object):
    '''iterateJson(parser,injson,object):getting a parser method an a json and iterating over the json
    with the parser method, checkes for existing objects in the db table and deletes them before commiting the new one to
    prevent integerityerrors
    writes ValidateJsonError to errorlog for each element in the json
    getting an onject name to check for loaded etc'''

    #first lets erase the table from loaded objects
    DBSession2.query(object).filter(object.LOADED=='Y').delete()
    print "finished deleting loaded"
    #now lets get a list from the table of loaded id
    raw_list = DBSession2.query(object).all() #the failing part!
    print "getting raw list of unloaded" #doesn't get here!
    if object == Activities:
        id_list = [e.EVENTCODE for e in raw_list]
        id = e.EVENTCODE

this is part of the sqlalchemy class:

class Deposit(DeclarativeBase2):
    __tablename__ = 'NOAR_LOADDEPOSIT'
    #LINE = Column(INT(8)) 
    RECDEBNUM = Column(NVARCHAR(9) , primary_key=True)
    CURDATE = Column(BIGINT, nullable=False, default=text(u'((0))')) 
    PAYTYPE = Column(CHAR(1), nullable=False, default=text(u"('')")) 
    BANKCODE = Column(NVARCHAR(8), nullable=False, default=text(u"('')")) 
    CUSTACCNAME = Column(NVARCHAR(16), nullable=False, default=text(u"('')")) 
    PAGENUM = Column(NVARCHAR(5), nullable=False, default=text(u"('')"))
    RECNUM = Column(NVARCHAR(2), nullable=False, default=text(u"('')")) 
    RECDATE = Column(BIGINT, nullable=False, default=text(u'((0))')) 
    FIXNUM = Column(NCHAR(1), nullable=False, default=text(u"('')")) 
    EVENTNUM = Column(NVARCHAR(5), nullable=False, default=text(u"('')")) 
    GROUPCODE = Column(NVARCHAR(7), nullable=False, default=text(u"('')")) 
    IDNUMBER = Column(NVARCHAR(9), nullable=False, default=text(u"('')")) 

and the other class (both give the same problem)

class Activities(DeclarativeBase2):  


    __tablename__ = 'NOAR_LOADEVENTS'

    EVENTCODE = Column(NVARCHAR(8), primary_key=True)
    EVENTDES = Column(Unicode, nullable=False, default=text(u"('')"))
    TYPE = Column(NCHAR(1), nullable=False, default=text(u"('')"))
    LC = Column(NCHAR(1), nullable=False, default=text(u"('')"))
    LD = Column(NCHAR(1), nullable=False, default=text(u"('')"))
    LE = Column(NCHAR(1), nullable=False, default=text(u"('')"))
    LF = Column(NCHAR(1), nullable=False, default=text(u"('')"))
    LG = Column(NCHAR(1), nullable=False, default=text(u"('')"))
    LH = Column(NCHAR(1), nullable=False, default=text(u"('')"))

using: python 2.7 (64bit win) with pyodbc 2.1.11 vs mssql server 2005, sqlalchemy 0.7.3 tg2.1.3

be glad for any help or reference

alonisser
  • 11,542
  • 21
  • 85
  • 139

2 Answers2

1

Based on what you say, it is highly likely that text columns in the database table are encoded in cp1255 (Windows Hebrew) but your code is expecting UTF-8, perhaps in default of being explicitly told anything about the encoding.

You need to find what call into sqlalchemy or pyodbc needs to modified to correct the misinformation.

John Machin
  • 81,303
  • 11
  • 141
  • 189
0

found the answer - a classic wtf. seems the the production server had a slightly different configuration then the test server (although both were supposed to be the same) - so instead of nvarchar my application model was expecting it got varchar. thanks for the help. writing the question helped me clear my thoughts anyway

alonisser
  • 11,542
  • 21
  • 85
  • 139
  • Data type `NVARCHAR` means that the text is encoded in UCS-2 which is what MS appears to call "Unicode" and which from the Python side would need to be decoded with UTF-16LE ... nothing to do with UTF-8 as far as I can tell. – John Machin Dec 27 '11 at 11:43
  • As far as I can tell, NVARCHAR means UTF-16LE is needed for decoding, VARCHAR means cp1255 is needed for decoding, and UTF-8 will not work for either case. – John Machin Dec 27 '11 at 19:33
  • actually sqlalchemy is a good translator between utf-8 and cp1255 (hebrew). so this isn't a problem – alonisser Dec 27 '11 at 20:08
  • Huh? It's not doing a good job in your question; it's trying to decode a cp1255-encoded bytestring using the utf8 codec – John Machin Dec 27 '11 at 21:04
  • ... and I find it very hard to believe that an ORM would offer a facility to directly transcode between two encodings – John Machin Dec 27 '11 at 21:12
  • actually it does and as I wrote, changing the table mapping in the ORM from nvarchar to varchar did the trick and solved the problem..! thanks for the reference – alonisser Dec 28 '11 at 09:46