I'm parsing a file which is in ascii format but includes non-ascii characters in big5 (Trad. Chinese).
For details is a CWR file from CISAC.
I'm trying to decode the non-ascii characters unsuccesfully. Here an example line:
NWN000003930000016400507347 ^N&ÊÅ+/{^O
From position 29 to 188 should be encoded in big5.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import sys
import binascii
from chardet.universaldetector import UniversalDetector
from chardet import detect
with open("/path/to/file") as fd:
line = fd.readline()
while line:
if line[0:3] == 'NWN':
last_name = line[29:188]
print last_name
print detect(line)['encoding']
print last_name.decode('big5')
line = fd.readline()
However, the result I get for the row above is:
None
&岒+/{
And for the following row:
NWN000000140000016300401453 ^N/õ<Dï.^O
even crashes:
windows-1252
Traceback (most recent call last):
File "test_big5.py", line 36, in <module>
print last_name.decode('big5')
UnicodeDecodeError: 'big5' codec can't decode bytes in position 1-2: illegal multibyte sequence
I also tried as follows:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
from codecs import EncodedFile
from_encoding = 'big5'
to_encoding = 'utf8'
sys.stdout = EncodedFile(sys.stdout, from_encoding, to_encoding)
f = file("/path/to/file", "r")
str = f.read()
sys.stdout.write(str)
I attach a sample file here
Any idea about what I'm doing wrong?