How to remove illegal characters so a dataframe can write to Excel

Question

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:

openpyxl.utils.exceptions.IllegalCharacterError

I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!

Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.

What's the best way or removing or replacing illegal characters from a dataframe?

This is one is good example to solve the error: https://stackoverflow.com/questions/45312111/python-dataframe-illegal-character-error-into-ascii-codec-decode-error/50173208#50173208 — DeshDeep Singh, Feb 26 '19 at 15:52

score 48 · Accepted Answer · answered Feb 21 '17 at 10:49

48

Based on Haipeng Su's answer, I added a function that does this:

dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
                 decode('utf-8') if isinstance(x, str) else x)

Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!

answered Feb 21 '17 at 10:49

user4896331

1,637
5
16
19

5

well....it works, but the generated file content is something like "\u4ee5\u4e0b\u6587\u9577..." – allenyllee Sep 17 '20 at 03:30

score 43 · Answer 2 · answered Jan 16 '19 at 09:06

43

The same problem happened to me. I solved it as follows:

install python package xlsxwriter:

pip install xlsxwriter

replace the default engine 'openpyxl' with 'xlsxwriter':

dataframe.to_excel("file.xlsx", engine='xlsxwriter')

answered Jan 16 '19 at 09:06

mathsyouth

3,770
1
20
18

2

I think this is a strong contender for what should be the accepted answer. – Ben Dalling Dec 13 '22 at 13:38

score 24 · Answer 3 · answered Jan 31 '18 at 16:08

24

try a different excel writer engine solved my problem.

writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')

answered Jan 31 '18 at 16:08

Jzou

1,225
3
11
22

Someone wrote "ES côté sick" in a field that comes back on one of my reports and Python was choking on it. I made the change suggested and it worked perfectly. – Eric Shreve Feb 12 '18 at 17:18
This is great, but how do you you use this engine and not overwrite the current file? – REdim.Learning Apr 04 '18 at 14:01
FFR, if you get a `No module named 'xlsxwriter'` error, you probably need to run `pip install xlsxwriter`, because `xlsxwriter` appears to be an optional dependency of pandas (meaning it won't be preinstalled) – Christian Reall-Fluharty Mar 19 '20 at 20:10

score 17 · Answer 4 · edited Oct 30 '20 at 02:57

17

If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.

Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:

ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')

This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:

dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)

The above line will remove those characters in every cell.

But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.

edited Oct 30 '20 at 02:57

MrVocabulary

597
1
11
31

answered Sep 18 '20 at 06:42

allenyllee

964
1
13
16

1

Wow the only thing that did actually work, i've been working on this issue for the last 3 days now. Thank you so much! – Dannark Sep 18 '20 at 15:02
How can I replace any `ILLEGAL_CHARACTERS` with a specific string? – Rami Alloush Dec 10 '20 at 01:20
1

@Rami Alloush `dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(STRING_YOU_WANT, x) if isinstance(x, str) else x)` change the `STRING_YOU_WANT` – allenyllee Dec 10 '20 at 02:57
2

Nice solution. Rather than redefining `ILLEGAL_CHARACTERS_RE`, you can just import it from openpyxl like `from openpyxl.cell.cell import ILLEGAL_CHARACTERS_RE`. – Daniel Himmelstein Nov 10 '21 at 21:54
@Daniel Himmelstein That seems like better practice and will help cover an instance where they redefine that – yem Feb 27 '23 at 07:10

score 5 · Answer 5 · answered Feb 17 '17 at 21:16

I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.

My method may not be the best, but it helps me to convert unicode string into ascii compatible.

# install unidecode first 
from unidecode import unidecode

def FormatString(s):
if isinstance(s, unicode):
  try:
    s.encode('ascii')
    return s
  except:
    return unidecode(s)
else:
  return s

df2 = df1.applymap(FormatString)

In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.

Hope this can give me some ideas to deal with your problems.

Thanks! That was really helpful. I used something similar based on your recommendation. — user4896331, Feb 21 '17 at 10:45

score 1 · Answer 6 · answered Nov 30 '20 at 12:10

1

You can use built-in strip() method for python strings.

for each cell:

text = str(illegal_text).strip()

for entire data frame:

dataframe = dataframe.applymap(lambda t: str(t).strip())

answered Nov 30 '20 at 12:10

Hosein Basafa

1,068
8
11

score 0 · Answer 7 · answered Apr 04 '18 at 16:30

If you're still struggling to clean up the characters, this worked well for me:

import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()

score 0 · Answer 8 · answered Mar 23 '23 at 02:26

I had your same issue. I needed to loop through a folder and copy data from .csv, then paste into an .xlsx with the same file name. I needed to overwrite the data, not create a new tab.

The try/catch solved my issues with the IllegalCharacter error

import os
import pandas as pd
import openpyxl.utils.exceptions as xlerr

# Define the folder path containing the CSV files
csv_folder = (r"C:\folderpath")


# Loop through each CSV file in the folder
for filename in os.listdir(csv_folder):
    if filename.endswith(".csv"):

# Define the output Excel file path
        output_file = 'C:\Temp/'+filename.split('.')[0]+'.xlsx'

        # Read the CSV file into a pandas dataframe
        csv_path = os.path.join(csv_folder, filename)
        df = pd.read_csv(csv_path,encoding='cp1252',header=None)
    
        # Write the selected data to the output Excel file
        try:
            with pd.ExcelWriter(output_file, mode='a', engine='openpyxl', if_sheet_exists='overlay') as writer:
                selected_data.to_excel(writer, sheet_name=filename.split('.')[0]+"", startrow=1, index=False, header=False)
        except xlerr.IllegalCharacterError as e:
            print(f"Error: {e}")

How to remove illegal characters so a dataframe can write to Excel

8 Answers8

Linked