1

I'm parsing a file which is in ascii format but includes non-ascii characters in big5 (Trad. Chinese).
For details is a CWR file from CISAC.

I'm trying to decode the non-ascii characters unsuccesfully. Here an example line:

NWN000003930000016400507347 ^N&ÊÅ+/{^O

From position 29 to 188 should be encoded in big5.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os
import sys
import binascii
from chardet.universaldetector import UniversalDetector
from chardet import detect

with open("/path/to/file") as fd:
    line = fd.readline()
    while line:
        if line[0:3] == 'NWN':
            last_name = line[29:188]
            print last_name
            print detect(line)['encoding']
            print last_name.decode('big5')
        line = fd.readline()

However, the result I get for the row above is:

None
&岒+/{

And for the following row:

NWN000000140000016300401453 ^N/õ<Dï.^O

even crashes:

windows-1252
Traceback (most recent call last):
  File "test_big5.py", line 36, in <module>
print last_name.decode('big5')
UnicodeDecodeError: 'big5' codec can't decode bytes in position 1-2: illegal multibyte sequence

I also tried as follows:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
from codecs import EncodedFile

from_encoding = 'big5'
to_encoding = 'utf8'    
sys.stdout = EncodedFile(sys.stdout, from_encoding, to_encoding)

f = file("/path/to/file", "r")
str = f.read()
sys.stdout.write(str)

I attach a sample file here

Any idea about what I'm doing wrong?

xtarafa
  • 41
  • 3
  • 3
    Why are you not opening the file as Big5 in the first place? – Ignacio Vazquez-Abrams Mar 22 '17 at 18:06
  • Can you post a small example file that we can try? Encoded files can be difficult to post, but if you have a file that fails in the first few lines, you could post the result of `print(open('/path/to/file','rb').readlines[:3])` and then we can easily take that list and rebuild the file ourselves. – tdelaney Mar 22 '17 at 18:16
  • I suspect you can solve the problem by opening the file in binary (`"rb"`). Your compare would have to be `if line[0:3] == b'NWN':`. – tdelaney Mar 22 '17 at 18:17
  • @IgnacioVazquez-Abrams comment may also work. Python's 'big5' codec accepts pure ascii characters along with big5 mbcs characters. – tdelaney Mar 22 '17 at 18:23
  • @tdelaney I tried opening in binary mode as you suggested, but no result. Same problem. I attached a sample file with the lines which causes the problem. – xtarafa Mar 27 '17 at 09:22
  • @IgnacioVazquez-Abrams I also tried. I've updated the question with a try opening the file as big5. I also attached a sample file. – xtarafa Mar 27 '17 at 09:34

1 Answers1

0

You should be able to read the file with the big5 codec. When trying it, I got

>>> import codecs
>>> codecs.open('nwn.file', encoding="big5").read()
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python2.7/codecs.py", line 668, in read
        return self.reader.read(size)
UnicodeDecodeError: 'big5' codec can't decode bytes in position 1790-1791: illegal multibyte sequence

The lines in your file are pretty long, so I read them into a list (no codecs, just open the file in "rb" mode and readlines()) and trimmed out whitespace. Now I can use this list as a runnable example. This is what I was getting at when I suggested you post data from the file read in binary mode.

test = [
b'NWN000003930000016400507347 \x0e&\xca\xc5+/{\x0f                    ZH\r\n'
b'NWN000003960000016400507347 \x0e&\xca\xc5+/{\x0f                    ZH\r\n'
b'NWN000005660000046800507347 \x0e&\xca\xc5+/{\x0f                    ZH\r\n'
b'NWN000016200000016400507347 \x0e&\xca\xc5+/{\x0f                    ZH\r\n'
b'NWN000025600000016400507347 \x0e&\xca\xc5+/{\x0f                    ZH\r\n'
b'NWN000000140000016300401453 \x0e/\xf5<D\xef.\x0f                    ZH\r\n' 
]

Then I did the decode line by line. Instead of the default errors='strict', I used replace to see what's going on. Those &岒+/{ are a bit odd, but then I don't know what this file is. Notice the question marks are the final line. There are non-big8 sequences. This file is corrupt.

>>> for line in test:
...     print line.strip().decode('big5', errors='replace')
... 
NWN000003930000016400507347 &岒+/{                    ZH
NWN000003960000016400507347 &岒+/{                    ZH
NWN000005660000046800507347 &岒+/{                    ZH
NWN000016200000016400507347 &岒+/{                    ZH
NWN000025600000016400507347 &岒+/{                    ZH
NWN000000140000016300401453 /�D�                    ZH

If you want most of the data, you could decode line by line like my example and catch that error.

tdelaney
  • 73,364
  • 6
  • 83
  • 116