Coerce UTF bytestream to ASCII

Question

I am aware that csv won't handle UTF directly, and that part of the solution is to open the file using codecs which opens the stream using the right encoding. I still get the error however:

 UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 121: ordinal not in range(128)

Is there a way to process the byte stream from infile, coercing it to ascii before it is handed over to csv.DictReader? Thanks.

 with( codecs.open( infileName , 'rU', 'utf-16') ) as infile:
     rdr = csv.DictReader( infile , delimiter='\t' )
     vnames = rdr.fieldnames
     for row in rdr:
         do_something(row)

The `csv` module still handles *bytestrings* only; it is trying to encode the unicode values *back to byte strings*. In other words, your approach is never going to work. — Martijn Pieters, Oct 01 '13 at 17:59
Open the file as a bytestream instead with using regular `open()`. — Martijn Pieters, Oct 01 '13 at 17:59
If by "coerce" you mean "remove any characters which are not 7-bit", that is not hard to do. Hint: `ord()` — tripleee, Oct 01 '13 at 18:02
@MartijnPieters: Opening it with regular `open` isn't going to work. `csv` handles *ASCII-compatible* byte strings only; UTF-16 doesn't qualify. — abarnert, Oct 01 '13 at 18:16
@abarnert: Yes, I realised that; contemplating VTCing this as a dupe of [Python UTF-16 CSV reader](http://stackoverflow.com/q/9177820). — Martijn Pieters, Oct 01 '13 at 18:17
As a side note, for Python 2.7, you may want to consider `io.open` instead of `codecs.open`. There are some cases for which their behavior differs, and in those cases you have to decide which behavior you want. But in most cases, they do effectively the same thing, and `io` has been getting improvements and bug fixes for the last 6 years (because it's the main interface to files in Python 3.x, and some things have been ported back), while `codecs` has not. — abarnert, Oct 01 '13 at 18:17
@MartijnPieters: It does seem pretty similar. And really, Mark Tolonen's answer there is a nice, simple solution to this poster's problem (although it's not the accepted answer, and it would be nice to have a little more than one line and a link). — abarnert, Oct 01 '13 at 18:18
@abarnert: I was toying with [`codecs.StreamRecoder()`](http://docs.python.org/2/library/codecs.html#codecs.StreamRecoder) as well, but the API is a little.. verbose. — Martijn Pieters, Oct 01 '13 at 18:20
@MartijnPieters: There's an example somewhere in the Python 3 docs of using the `io` module to do the rough equivalent of `StreamRecoder` but simpler. But really, the `UTF8Recoder` in the `csv` examples is all you need for this case, because `csv` doesn't care about the whole file-like interface, just `next`. — abarnert, Oct 01 '13 at 18:24
@MartijnPieters: Or, really, just `(line.encode('utf-8') for line in infile)`… I suspect that example was written back in 2.3 and not updated for generator expressions in 2.4. — abarnert, Oct 01 '13 at 18:27

abarnert · Accepted Answer · 2013-10-01T18:26:12.937

The problem isn't that "csv won't handle UTF directly"; nothing in Python handles UTF directly, and you wouldn't want it to. When you want Unicode, you use Unicode; when you want a particular encoding (whether UTF-8, UTF-16, or otherwise), you have to use strings and keep track of the encoding manually.

Python 2.x csv can't handle Unicode, so the easy way is ruled out. In fact, it only understands bytes strings, and always treats them as ASCII. However, it doesn't tamper with anything other than the specific characters it cares about (delimiter, quote, newline, etc.). So, as long as you use a charset whose ,, ", and \n (or whatever special characters you've selected) are guaranteed to be encoded to the same byte as in ASCII, and nothing else will ever be encoded to those bytes, you're fine.

Of course you don't just want to create a CSV file in any arbitrary charset; you presumably want to consume it in some other program—Excel, a script running on a server somewhere, whatever—and you need to create a CSV file in the charset that other program expects. But if you have control over the other program (e.g., it's Excel, and you know how to select the charset in its Import command), UTF-8 is almost always the best choice.

At any rate, UTF-16 does not qualify as a CSV-friendly charset, because, e.g., , is two bytes, not one.

So, how do you deal with this? The Examples in the documentation have the answer. If you just copy the unicode_csv_reader function and use it together with codecs.open, you're done. Or copy the UnicodeReader class and pass it an encoding.

But if you read the code for the examples, you can see how trivial it is: Decode your UTF-16, re-encode to UTF-8, and pass that to the reader or DictReader. And you can reduce that to one extra line of code, (line.encode('utf-8') for line in infile). So:

with codecs.open(infileName , 'rU', 'utf-16') as infile:
    utf8 = (line.encode('utf-8') for line in infile)
    rdr = csv.DictReader(utf8, delimiter='\t')
    vnames = rdr.fieldnames
    for row in rdr:
        do_something(row)

Finally, why is your existing code raising that exception? It's not in the UTF-16 decoding. It's because you're passing the resulting unicode strings to code that wants a bytes str. In Python 2.x, that pretty much always means to automatically encode it with the default encoding, which defaults to ASCII, which is what raises the error. And that's why you have to explicitly encode to UTF-8.

Thanks abarnert, Martijn. That's a lot to chew; my thinking about the encoding processing was off. I'll step through the process. — user2808321, Oct 01 '13 at 19:13
@user2808321: You pretty much always have to think through the encoding process, especially when your input and output charset is something non-ASCII-friendly like UTF-16. Unless, of course, you upgrade to Python 3.x. — abarnert, Oct 01 '13 at 19:32

Coerce UTF bytestream to ASCII

1 Answers1