The problem isn't that "csv
won't handle UTF directly"; nothing in Python handles UTF directly, and you wouldn't want it to. When you want Unicode, you use Unicode; when you want a particular encoding (whether UTF-8, UTF-16, or otherwise), you have to use strings and keep track of the encoding manually.
Python 2.x csv
can't handle Unicode, so the easy way is ruled out. In fact, it only understands bytes strings, and always treats them as ASCII. However, it doesn't tamper with anything other than the specific characters it cares about (delimiter, quote, newline, etc.). So, as long as you use a charset whose ,
, "
, and \n
(or whatever special characters you've selected) are guaranteed to be encoded to the same byte as in ASCII, and nothing else will ever be encoded to those bytes, you're fine.
Of course you don't just want to create a CSV file in any arbitrary charset; you presumably want to consume it in some other program—Excel, a script running on a server somewhere, whatever—and you need to create a CSV file in the charset that other program expects. But if you have control over the other program (e.g., it's Excel, and you know how to select the charset in its Import command), UTF-8 is almost always the best choice.
At any rate, UTF-16 does not qualify as a CSV-friendly charset, because, e.g., ,
is two bytes, not one.
So, how do you deal with this? The Examples in the documentation have the answer. If you just copy the unicode_csv_reader
function and use it together with codecs.open
, you're done. Or copy the UnicodeReader
class and pass it an encoding
.
But if you read the code for the examples, you can see how trivial it is: Decode your UTF-16, re-encode to UTF-8, and pass that to the reader
or DictReader
. And you can reduce that to one extra line of code, (line.encode('utf-8') for line in infile)
. So:
with codecs.open(infileName , 'rU', 'utf-16') as infile:
utf8 = (line.encode('utf-8') for line in infile)
rdr = csv.DictReader(utf8, delimiter='\t')
vnames = rdr.fieldnames
for row in rdr:
do_something(row)
Finally, why is your existing code raising that exception? It's not in the UTF-16 decoding. It's because you're passing the resulting unicode
strings to code that wants a bytes str
. In Python 2.x, that pretty much always means to automatically encode it with the default encoding, which defaults to ASCII, which is what raises the error. And that's why you have to explicitly encode to UTF-8.