1

Im getting data from a csv file, doing something with it and then writing it to a text template.

The problem occurs when I come across characters that I cannot encode.

For example, when I come accross a value written in chinese, the selected field is blank when I open it with some kind of a csv editor (e.g. LibreOffice Calc for Linux).

But when I get the data via csv.reader in my script, I can see that it is actually a string that hasn't been decoded properly. And when I try to write it to a template, I get this weird SUB string.

Here is the breakdown of the problem:

for row in csv.DictReader(csvfile):
    # take value from the row and store it in a dictionary
    ....
    # take the values from the dictionary and write them to a template
    with open('template.txt', 'r+') as template:
        src = Template(template.read())
        content = src.substitute(rec)

    with open('myoutput.txt', 'w') as bill:
        bill.write(content)

And the template.txt looks like this:

$name
$address
$city
...

All of this generates txt files like this:

Bill
North Grove 14
Scottsdale
...

If any of the dictionary values are empty, e.g. an empty string '', my template rendering function ignores the tag, so for example if the address attribute was missing from a particular row, the output would be

Bill
Scottsdale
...

When I try to do that with my chinese data, my function does write the data because the strings in question are not empty. And when I write them to a template, the end result looks like this:

    SUB
    SUB
    Hong Kong
    ...

How can I display my data properly? Also is there a way to skip that data, for example something that can try to decode the data, and if it's not successful, convert it to an empty string. P.S. try except won't work here, because mystring.encode('utf-8') or mystring.encode('latin-1') will encode the string, but it will still be outputted as garbage.

EDIT

After printing out the problem row, the output of the problematic values is the following:

{'Name': '\x1a \x1a\x1a', 'State': '\x1a\x1a\x1a'}
Saša Kalaba
  • 4,241
  • 6
  • 29
  • 52

2 Answers2

2

\x1a is the ASCII substitute character. This is the reason why you see "SUB" in your output. This character is generally used as a replacement by programs that try to decode bytes but fail.

Your CSV file does not contain valid data. Probably it was generated starting from a source containing valid data, but the file itself does not contain valid data anymore.

Just guessing: perhaps, did you open the file with LibreOffice and then saved it?


If you want to check whether your string contains ASCII unprintable characters, use this:

def is_printable(data):
    return all(c in string.printable for c in data)

If you want to remove ASCII unprintable characters:

def strip_unprintable(data):
    return ''.join(c for c in data if c in string.printable)

If you want to deal with Unicode strings, then replace c in string.printable with:

ord(c) > 0x1f and ord(c) != 0x7f and not (0x80 <= ord(c) <= 0x9f)

(Credit goes to What is the range of Unicode Printable Characters?)

Community
  • 1
  • 1
Andrea Corbellini
  • 17,339
  • 3
  • 53
  • 69
0

Thanks to @Andrea Corbellini, your answer helped me find a solution.

def stringcheck(line):
    for letter in line:
        if letter not in string.printable:
            return 0
    return 1

However I don't think this is the most pythonic way of doing this, so any suggestions on how to make this better would be much appreciated.

Saša Kalaba
  • 4,241
  • 6
  • 29
  • 52