Downsides to reading strings from Excel in python using encode('utf-8')

Question

I am reading a large amount of data from an excel spreadsheet in which I read (and reformat and rewrite) from the spreadsheet using the following general structure:

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

where x and y are arbitrary cells in this case, with x being less arbitrary and containing utf-8 characters

So far I have only been using the .encode('utf-8') in cells where I know there will be errors otherwise or foresee an error without using utf-8.

My question is basically this: is there a disadvantage to using .encode('utf-8') on all of the cells even if it is unnecessary? Efficiency is not an issue. the main issue is that it works even if there is a utf-8 character in a place there shouldn't be. If no errors would occur if I just lump the ".encode('utf-8')" onto every cell read, I will probably end up doing that.

Michael · Accepted Answer · 2011-10-14T01:20:11.773

4

The XLRD Documentation states it clearly: "From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode.". Since you are likely reading in files newer than 97, they are containing Unicode codepoints anyway. It is therefore necessary that keep the content of these cells as Unicode within Python and do not convert them to ASCII (which you do in with the str() function). Use this code below:

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

edited Oct 14 '11 at 01:20

answered Oct 13 '11 at 03:17

Michael

751
4
8

That is a great answer! thank you! however, I just realized a second part to my question. I am then using the output of this to upload into SQL Tables. Would SQL support the output of your modified code? – Logan Oct 13 '11 at 03:45
what does str(sheettwo.cell(z,x).value.encode('utf-8')) do to utf-8 strings with unicode only characters in it? – Logan Oct 13 '11 at 06:09
@MichaelKlocker: I'm afraid that this solution does not work on Windows: in fact, `codecs.open()` opens files in *binary* mode, so that `\n` is not converted to the Windows newline codes. The simplest solution to this problem seems to not use `codecs` and instead manually encode text upon writing (http://stackoverflow.com/questions/5941988/print-to-utf-8-encoded-file-with-platform-dependent-newlines). – Eric O. Lebigot Oct 13 '11 at 07:53
@EOL - Thanks for the hint. I was not aware of that. @ logan - It seems that you really just want to get the data out of this Excel spreadsheet and into a DB. Is it even necessary to write Python code for it? Saving the file as a CSV file might do the trick. – Michael Oct 14 '11 at 01:15
Hi Logan, The command: str(sheettwo.cell(z,x).value.encode('utf-8')) ... will fail if this cell contains Unicode characters. The reason is simple. Unicode is really just showing you code-points, how you are writing these to the disk is dependent on the encoding. Now if you try to take a Unicode code point that is above ASCII character 127 and try to force this code point into ASCII (by using the str() method), Python will raise an exception to prevent you from losing data. You seem to be confused about Unicode vs. UTF-8. For more info read: http://www.joelonsoftware.com/articles/Unicode.html – Michael Oct 14 '11 at 01:25
@MichaelKlocker I don't know how, but it works using the .encode('utf-8'). I won't pretend to guess at how it works because my limited understanding would agree with you, but I put a non-ASCII character in a cell and used that on it and it wrote it to the file correctly. It does throw an error if I only have numbers in the cell, but thats a different easy to take care of issue. I do get the exception when i don't use .encode, but with the .encode it passes flawlessly. And isn't everyone confused about character sets? lol Also, exporting it as a CSV won't work. We tried that first. – Logan Oct 22 '11 at 22:22

John Machin · Answer 2 · 2011-10-30T11:43:20.780

This answer is really a few mild comments on the accepted answer, but they need better formatting than the SO comment facility provides.

(1) Avoiding the SO horizontal scrollbar enhances the chance that people will read your code. Try wrapping your lines, for example:

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

(2) Presumably you are using unicode() to convert floats and ints to unicode; it does nothing for values that are already unicode. Be aware that unicode(), like str(), gives you only 12 digits of precision for floats:

>>> unicode(123456.78901234567)
u'123456.789012'

If that is a bother, you might like to try something like this:

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

(3) xlrd builds Cell objects on the fly when demanded.

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

Downsides to reading strings from Excel in python using encode('utf-8')

2 Answers2