What is the best way to convert all non display characters in a pandas dataframe?

Question

I am loading data into a pandas dataframe from an Excel sheet and there are a lot of non display characters in many columns that I want to convert.

The most prevalent is an apostrophe being used in a contraction ; e.g. doesn't which comes out as doesnâ€™t.

In the past I have used :

str.encode('ascii', errors='ignore').decode('utf-8')

but this required me to know which columns I needed to fix.

In this case I have 103 columns which could each contain this or other types of issues like this.

I am looking for a way to just replace any and all issues across the entire dataframe.

Is there a quick and easy way to do this over the entire dataframe without having to pass in each column to a function ?

Please try this : dataset=pd.read_csv(“Your_filename.csv”, encoding=”ISO-8859–1”) — vbhargav875, May 09 '20 at 16:08

score 0 · Answer 1 · answered May 09 '20 at 16:12

0

While reading the excel you should add encoding='utf-8'

df = pd.read_excel('App Stuff.xlsx', encoding='utf-8')

or use encoding='unicode-escape'

answered May 09 '20 at 16:12

NYC Coder

7,424
2
11
24

1

I tried your solution but i got this error `TypeError: read_excel() got an unexpected keyword argument 'encoding'` – Learner Sep 01 '21 at 13:51

score 0 · Answer 2 · answered May 09 '20 at 17:30

Try to find the best encodings which are suitable for your file with :

from encodings.aliases import aliases
alias_values = set(aliases.values())

for value in alias_values:
    try:
        df = pd.read_csv(your_file, encoding=value) # or pd.read_excel
        print(value)
    except:
        continue

then open your file with the right encodings and see which one works best!

What is the best way to convert all non display characters in a pandas dataframe?

2 Answers2