1

I am loading data into a pandas dataframe from an Excel sheet and there are a lot of non display characters in many columns that I want to convert.

The most prevalent is an apostrophe being used in a contraction ; e.g. doesn't which comes out as doesn’t.

In the past I have used :

str.encode('ascii', errors='ignore').decode('utf-8')

but this required me to know which columns I needed to fix.

In this case I have 103 columns which could each contain this or other types of issues like this.

I am looking for a way to just replace any and all issues across the entire dataframe.

Is there a quick and easy way to do this over the entire dataframe without having to pass in each column to a function ?

Littm
  • 4,923
  • 4
  • 30
  • 38
Justin
  • 11
  • 1

2 Answers2

0

While reading the excel you should add encoding='utf-8'

df = pd.read_excel('App Stuff.xlsx', encoding='utf-8')

or use encoding='unicode-escape'

NYC Coder
  • 7,424
  • 2
  • 11
  • 24
  • 1
    I tried your solution but i got this error `TypeError: read_excel() got an unexpected keyword argument 'encoding'` – Learner Sep 01 '21 at 13:51
0

Try to find the best encodings which are suitable for your file with :

from encodings.aliases import aliases
alias_values = set(aliases.values())

for value in alias_values:
    try:
        df = pd.read_csv(your_file, encoding=value) # or pd.read_excel
        print(value)
    except:
        continue

then open your file with the right encodings and see which one works best!

Ehsan
  • 711
  • 2
  • 7
  • 21