Pandas - Writing an excel file containing unicode - IllegalCharacterError

Question

I have the following code:

import pandas as pd

x = [u'string with some unicode: \x16']
df = pd.DataFrame(x)

If I try to write this dataframe as an excel file:

df.to_excel("test.xlsx")

Or, if I try to write this dataframe as an excel file, with utf-8 encoding:

ew = pd.ExcelWriter('test.xlsx',options={'encoding':'utf-8'})
df.to_excel(ew)

I get the following error:

IllegalCharacterError                     Traceback (most recent call last)
<ipython-input-4-62adec25ae8d> in <module>()
      1 ew = pd.ExcelWriter('test.xlsx',options={'encoding':'utf-8'})
      2 #df.to_excel("test.xlsx")
----> 3 df.to_excel(ew)

/usr/local/lib/python2.7/dist-packages/pandas/util/decorators.pyc in wrapper(*args, **kwargs)
     86                 else:
     87                     kwargs[new_arg_name] = new_arg_value
---> 88             return func(*args, **kwargs)
     89         return wrapper
     90     return _deprecate_kwarg

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep)
   1258         formatted_cells = formatter.get_formatted_cells()
   1259         excel_writer.write_cells(formatted_cells, sheet_name,
-> 1260                                  startrow=startrow, startcol=startcol)
   1261         if need_save:
   1262             excel_writer.save()

/usr/local/lib/python2.7/dist-packages/pandas/io/excel.pyc in write_cells(self, cells, sheet_name, startrow, startcol)
    679             colletter = get_column_letter(startcol + cell.col + 1)
    680             xcell = wks.cell("%s%s" % (colletter, startrow + cell.row + 1))
--> 681             xcell.value = _conv_value(cell.val)
    682             style_kwargs = {}
    683 

/usr/local/lib/python2.7/dist-packages/openpyxl/cell/cell.pyc in value(self, value)
    360     def value(self, value):
    361         """Set the value and infer type and display options."""
--> 362         self._bind_value(value)
    363 
    364     @property

/usr/local/lib/python2.7/dist-packages/openpyxl/cell/cell.pyc in _bind_value(self, value)
    269             elif self.guess_types:
    270                 value = self._infer_value(value)
--> 271         self.set_explicit_value(value, self.data_type)
    272 
    273 

/usr/local/lib/python2.7/dist-packages/openpyxl/cell/cell.pyc in set_explicit_value(self, value, data_type)
    235             raise ValueError('Invalid data type: %s' % data_type)
    236         if isinstance(value, STRING_TYPES):
--> 237             value = self.check_string(value)
    238         self._value = value
    239         self.data_type = data_type

/usr/local/lib/python2.7/dist-packages/openpyxl/cell/cell.pyc in check_string(self, value)
    220         value = value[:32767]
    221         if next(ILLEGAL_CHARACTERS_RE.finditer(value), None):
--> 222             raise IllegalCharacterError
    223         return value
    224 

IllegalCharacterError:

How can I write a pandas dataframe containing unicode to an excel file?

score 71 · Answer 1 · answered Jan 16 '19 at 09:02

71

The same problem happened to me. I solved it as follows:

First, install python package xlsxwriter:

pip install xlsxwriter

Second, replace the default engine 'openpyxl' with 'xlsxwriter':

df.to_excel("test.xlsx", engine='xlsxwriter')

answered Jan 16 '19 at 09:02

mathsyouth

3,770
1
20
18

3

Do you know why xlsxwriter manages to solve the problem? – Amr Keleg Mar 12 '22 at 10:45

score 10 · Answer 2 · edited Jan 26 '18 at 02:50

10

Use this to remove any error that you might be getting. You can save to excel post this.

df = df.applymap(lambda x: x.encode('unicode_escape').
                 decode('utf-8') if isinstance(x, str) else x)

edited Jan 26 '18 at 02:50

Stephen Rauch

47,830
31
106
135

answered Jan 26 '18 at 02:28

Krishna Aswani

181
2
13

3

well....it works, but the generated file content is something like "\u4ee5\u4e0b\u6587\u9577..." – allenyllee Sep 17 '20 at 03:31

bobince · Accepted Answer · 2015-03-06T07:28:31.703

6

Not a Unicode issue as such... \x16 (or in Unicode strings \u0016 refers to the same character) is ASCII control code 22 (SYN). Pandas says it's invalid to have control codes (other than tab and newlines) in an Excel file, and though I don't know much about Excel files it would certainly be impossible to include them in an XML 1.0 file, which is what's inside a xlsx.

So most likely there is no way to include arbitrary character sequences (with control codes) in an Excel. You should filter them out before writing, or if you really need to preserve the original data use some form of ad hoc encoding recognised only by your application.

edited Mar 06 '15 at 07:28

answered Mar 04 '15 at 14:14

bobince

528,062
107
651
834

Yes, the problem is at the XML level – these control characters cannot be serialised in XML and would be pretty meaningless in Excel anyway. – Charlie Clark Mar 05 '15 at 08:49

score 5 · Answer 4 · answered Sep 18 '20 at 06:46

I've answered a similar question at this post: https://stackoverflow.com/a/63950544/1851492, below is the same content.

If you don't want to install another excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which cause IllegalCharacterError raised.

Open cell.py which under the path /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:

ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')

This line is what you need to remove those characters. Copy this line to your program and execute below code before your dataframe is writing to excel:

dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)

The above line will apply remove those characters to every cells.

Louise · Answer 5 · 2015-03-03T17:58:45.990

0

I don't know this particular language, but generally there is an error with excel and UTF8. If you just open a file of UTF8 characters with excel programatically, it will corrupt them (it doesn't seem to handle all the bits in the character, but truncates it to effectively the first 2 and last 2 hex numbers of the 8 present in extended characters).

A work around, to load a utf file correctly into excel, is to get the program insert a macro into your excel sheet after you have loaded it which imports the data. I have some code to do this in C#, if that's any help?

does your input contain any extended characters (i.e. àâäçæèëéêìïîñòöôœûüùÿÀÂÄÇÆÈËÉÊÌÏÎÑÒÖÔŒÛÜÙŸ) and if you take them out, does it work?

edited Mar 03 '15 at 17:58

answered Mar 03 '15 at 16:46

Louise

382
1
6
16

Hi Louise, the error seems to occur with '\x16'. If I change my code to `x = [u'string with some unicode: \u03BB']` which is the lambda character, everything is fine. In fact any character starting with '\u' is fine, but any character starting with '\x' causes a problem. Perhaps characters starting with '\x' are not unicode? – killajoule Mar 03 '15 at 17:11
Don't know, I'm afraid. I just know that I had a similar problem with unicode to Excel, and was hoping to suggest a course of investigation. I had characters such as á which were coming out as Ã¡ in excel, and I discovered that the pattern was that say the unicode was c2ab cd2f the final character in excel was c22f and abcd, i.e. the order of the hex had been changed around or in some cases ignored. – Louise Mar 03 '15 at 17:35
Louise what you're describing is typical of UTF-8 being treated as Latin-1 or Windows 1252 (UTF-8 characters are two bytes, Latin-1 and Windows 1252 only one). If you open a data file from Excel it will always prompt you for a character encoding. – Charlie Clark Mar 05 '15 at 08:53

score 0 · Answer 6 · answered Nov 30 '18 at 20:04

When I encounter this error, I usually go around it by writing the file to a '.csv instead of '.xlsx' files. So instead of

yourdataframe.to_excel('Your workbook name.xlsx')

I would do:

yourdataframe.to_csv('Your workbook name.csv')

It appears the way pandas decodes .csv files by default is:

encoding : string, optional
A string representing the encoding to use in the output file,
defaults to 'ascii' on Python 2 and 'utf-8' on Python 3.

On the other hand default encoding of .xlsx files is:

encoding: string, default None
encoding of the resulting excel file. Only necessary for xlwt,
other writers support unicode natively.

This difference is responsible for that error. You will also get the error when you write data with strings that start with - or + to a .xlsx file.

score 0 · Answer 7 · answered Dec 10 '20 at 11:17

for writing a data frame containing unicode characters to multiple sheets in a single excel file below code can be helpful:

%pip install xlsxwriter
from pandas import ExcelWriter
import xlsxwriter
writer = ExcelWriter('notes.xlsx')
for key in dict_df:
        data[key].to_excel(writer, key,index=False,engine='xlsxwriter')
writer.save()

Pandas - Writing an excel file containing unicode - IllegalCharacterError

7 Answers7

Linked