Character Encoding From Chinese to Latin1 in Python

Question

I'm trying to convert a localization file that contains Chinese characters such that the chinese characters are converted into latin1 encoding.

However, when I run the python script I get this error...

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb9 in position 0: ordinal not in range(128)

Here's my python script, it essentially just takes the users input to convert said file. Then converts the file (all lines that start with a [ or are empty, should be skipped)... The part that needs to be converted is always at index 1 in an list.

# coding: utf8

# Enter File Name
file_name = raw_input('Enter File Path/Name To Convert: ')

# Open the File we Write too...
write_file = open(file_name + "_temp", 'w+')

# Open the File we Read From...
read_file = open(file_name)

with open(file_name) as file_to_write:
    for line in file_to_write:
        # We ignore any line that starts with [] or is empty...
        if line and line[0:1] != '[':
            split_string = line.split("=")
            if len(split_string) == 2:
                write_file.write(split_string[0] + "=" + split_string[1].encode('gbk').decode('latin1') + "\n")
            else:
                write_file.write(line)
        else:
            write_file.write(line)



# Close File we Write too..
write_file.close()

# Close File we read too..
read_file.close()

And example config file is...

[Example]
Password=密碼

The output should be converted into...

[Example]
Password=±K½X

http://www.joelonsoftware.com/articles/Unicode.html – jsbueno Jan 21 '16 at 17:49 — jsbueno, Jan 21 '16 at 17:49

jsbueno · Answer 1 · 2016-01-21T17:47:02.063

The Latin1 encoding cannot represent chinese characters. The better you can get if all you have for utput is latin1, are escape sequences.

You are using Python 2.x - Python3.x open the files as text, and automatically decodes the read bytes to (unicode) strings on reading.

In Python2, when you read a file you get the bytes - you are responsible for decoding these bytes to text (unicode objects in Python 2.x) - processing them, and re-encoding them to the desired encoding upon recording the information to another file.

So, the line that reads:

write_file.write(split_string[0] + "=" + split_string[1].encode('gbk').decode('latin1') + "\n")

Should be:

write_file.write(split_string[0] + "=" + split_string[1].decode('gbk').encode('latin1', errors="escape") + "\n")

instead.

now, note that I added the parameters errors="escape" to the decode call - as I said above remains true: latin1 is a characterset of 233 or so characters - it does contain the latin letters, and the most used accented characters ("á é í ó ú ç ã ñ"... etc), some puntuaction and math symbols, but no characters for other languages.

If you have to represent these as text, you should use the utf-8 encoding - and configure whatever software you are using to consume the generated file to that encoding instead.

That said, what you are doing is just a horrible practice. Unless you are opening a really nightmarish file which is known to contain text in different encoding, you should just decode all text to unicode, and them re-encode it all - not just the part of the data which is meant to have non-ASCII characters. Just don't do that if the you have other, gbk incompatible, characters in the original file, otherwise, your inner loop could as well be:

with open(file_name) as read_file, open(file_name + "_temp", "wt") as write_file:
    for line in read_file:
        write_file.write(line.decode("gbk").encode("utf-8")

As for your "example output" - that is just the _very_same file, i.e. the same bytes on the first file. The program displaying the line that goes:"Password=密碼" is "seeing" the file with the GBK encoding, and the other program is "seeing" the exact same bytes, but interpreting them as latin1. You should not have to convert from one to the other.

I can not change how the program that loads these localization files reads them, but it does indeed read them as latin1. That being said.... Is there anyway to just change the files encoding to latin1 so it is read as latin1 in all instances? Your change not produces this error.... UnicodeDecodeerror: 'gbk' codec can't decode bytes in position 2-3: illegal multibyte sequence — Ricky, Jan 21 '16 at 17:55
No, there is no way to inherently know the encoding of a text file. The program reading it have to be "told" it should be read as latin1. Latin1 is a particularly interesting codec in which all byte values are valid, and thus one can read a file as latin1 and write it back, even if it has another encoding, without destroying information or getting errors. — jsbueno, Jan 21 '16 at 19:30
Interesting... Do you know how I might resolve the new error? — Ricky, Jan 21 '16 at 19:39
If you are getting a GBK decoding error, then you are in that worst-case scenario where this file indeed have inconsistent encoding. Revert to the way you where doing it, wher eyou just decode the parts that are in Chinese. This will fee you from the errors, but won't happen towards what you need: Read the unicode article I referred above so you get more ideas, You are on the wrong track, — jsbueno, Jan 22 '16 at 12:37

Character Encoding From Chinese to Latin1 in Python

1 Answers1