Pandas DataFrame: remove � (unknown-character) from strings in rows

Question

I have read a csv file into python 2.7 (windows machine). Sales Price column seems to be mixture of string and float. And some rows contains a euro symbol €. Python sees € as �.

df = pd.read_csv('sales.csv', thousands=',')
print df

Gender  Size    Color   Category    Sales Price
Female  36-38   Blue    Socks       25
Female  44-46   Pink    Socks       13.2
Unisex  36-38   Black   Socks      � 19.00
Unisex  40-42   Pink    Socks      � 18.50
Female  38      Yellow  Pants      � 89,00
Female  43      Black   Pants      � 89,00

I was under the assumption that a simple line with replace will solve it

df=df.replace('\�','',regex=True).astype(float)

But I got encoding error

SyntaxError: Non-ASCII character

Would appreciate hearing your thoughts on this

How does work `df = pd.read_csv('sales.csv', thousands=',', encoding='utf-8')` or `df = pd.read_csv('sales.csv', thousands=',', encoding='latin')` ? — jezrael, Jul 13 '17 at 10:08
Try finding the encoding with `chardet` utility and then specify it in pandas params. — Sergey Bushmanov, Jul 13 '17 at 10:15

score 4 · Answer 1 · answered Jul 27 '17 at 06:27

I faced a similar problem where one of the column in my dataframe had lots of currency symbols. Euro, Dollar, Yen, Pound etc. I tried multiple solutions but the easiest one was to use unicodedata module.

df['Sales Price'] = df['Sales Price'].str.replace(unicodedata.lookup('EURO SIGN'), 'Euro')

The above will replace € with Euro in Sales Price column.

score 3 · Answer 2 · answered Jul 13 '17 at 11:48

I think @jezrael comment is valid. First you need to read the file with encoding(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html under encoding section)

df=pd.read_csv('sales.csv', thousands=',', encoding='utf-8')

but for replacing Euro sign try this:

df=df.replace('\u20AC','',regex=True).astype(float)

Pandas DataFrame: remove � (unknown-character) from strings in rows

2 Answers2