How to detect undecoded characters in python?

Question

Im getting data from a csv file, doing something with it and then writing it to a text template.

The problem occurs when I come across characters that I cannot encode.

For example, when I come accross a value written in chinese, the selected field is blank when I open it with some kind of a csv editor (e.g. LibreOffice Calc for Linux).

But when I get the data via csv.reader in my script, I can see that it is actually a string that hasn't been decoded properly. And when I try to write it to a template, I get this weird SUB string.

Here is the breakdown of the problem:

for row in csv.DictReader(csvfile):
    # take value from the row and store it in a dictionary
    ....
    # take the values from the dictionary and write them to a template
    with open('template.txt', 'r+') as template:
        src = Template(template.read())
        content = src.substitute(rec)

    with open('myoutput.txt', 'w') as bill:
        bill.write(content)

And the template.txt looks like this:

$name
$address
$city
...

All of this generates txt files like this:

Bill
North Grove 14
Scottsdale
...

If any of the dictionary values are empty, e.g. an empty string '', my template rendering function ignores the tag, so for example if the address attribute was missing from a particular row, the output would be

Bill
Scottsdale
...

When I try to do that with my chinese data, my function does write the data because the strings in question are not empty. And when I write them to a template, the end result looks like this:

    SUB
    SUB
    Hong Kong
    ...

How can I display my data properly? Also is there a way to skip that data, for example something that can try to decode the data, and if it's not successful, convert it to an empty string. P.S. try except won't work here, because mystring.encode('utf-8') or mystring.encode('latin-1') will encode the string, but it will still be outputted as garbage.

EDIT

After printing out the problem row, the output of the problematic values is the following:

{'Name': '\x1a \x1a\x1a', 'State': '\x1a\x1a\x1a'}

Are you using Python 2 or 3? What is the encoding of your csv file? How do you know that your file contains valid data (if both LibreOffice and Python are not working, I guess you are using a third tool)? — Andrea Corbellini, Sep 03 '15 at 17:31
@Andrea Corbellini I don't know the encoding of the csv file, it's not mine. I also don't know if the file contains valid data. — Saša Kalaba, Sep 03 '15 at 17:42
Try asking file: `file name-of-csv-file`. It should return something like `a.csv: UTF-8 Unicode text`. — Andrea Corbellini, Sep 03 '15 at 17:43
This may be relevant http://stackoverflow.com/questions/889941/which-encoding-uses-the-x-backslash-x-prefix — jtrayford, Sep 03 '15 at 18:19

score 2 · Accepted Answer · edited May 23 '17 at 11:44

\x1a is the ASCII substitute character. This is the reason why you see "SUB" in your output. This character is generally used as a replacement by programs that try to decode bytes but fail.

Your CSV file does not contain valid data. Probably it was generated starting from a source containing valid data, but the file itself does not contain valid data anymore.

^{Just guessing: perhaps, did you open the file with LibreOffice and then saved it?}

If you want to check whether your string contains ASCII unprintable characters, use this:

def is_printable(data):
    return all(c in string.printable for c in data)

If you want to remove ASCII unprintable characters:

def strip_unprintable(data):
    return ''.join(c for c in data if c in string.printable)

If you want to deal with Unicode strings, then replace c in string.printable with:

ord(c) > 0x1f and ord(c) != 0x7f and not (0x80 <= ord(c) <= 0x9f)

(Credit goes to What is the range of Unicode Printable Characters?)

score 0 · Answer 2 · answered Sep 04 '15 at 11:42

Thanks to @Andrea Corbellini, your answer helped me find a solution.

def stringcheck(line):
    for letter in line:
        if letter not in string.printable:
            return 0
    return 1

However I don't think this is the most pythonic way of doing this, so any suggestions on how to make this better would be much appreciated.

How to detect undecoded characters in python?

2 Answers2