0

How do I remove non-ascii characters (e.g б§•¿µ´‡»Ž®ºÏƒ¶¹) from texts in pandas dataframe columns?

I have tried the following but no luck

df = pd.read_csv(path, index_col=0)
for col in df.columns:
for j in df.index:
    markup1 = str(df.ix[j, col]).replace("\r", "")
    markup1 = markup1.replace("\n", "")
    markup1 = markup1.decode('unicode_escape').encode('ascii','ignore').strip()
soup = BeautifulSoup(markup1, 'lxml')
df.ix[j, col] = soup.get_text()
print df.ix[j, 'requirements']

I tried using regex yet it wouldn't work.

markup1 = str(df.ix[j, 'requirements']).replace("\r", "")
markup1 = markup1.replace("\n", "")
markup1 = re.sub(r'[^\x00-\x7F]+', ' ', markup1)

I still keep getting the non-ascii characters. Any suggestion would be appreciated.

I have added the first three rows of the df below:

                                              col1               col2  \
1.0                          H1B SPONSOR FOR L1/L2/OPT  US, NY, New York
2.0                             Graphic / Web Designer     US, TX, Austin
3.0  Full Stack Developer (.NET or equivalent + Jav...             GR, ,

                col3  col4  \
1.0                  NaN   NaN
2.0  Sales and Marketing   NaN
3.0                  NaN   NaN

                                              col5  \ 
1.0  i28 Technologies has demonstrated expertise in...
2.0  outstanding people who believe that more is po...
3.0                                                NaN

                                              col6  \
1.0  Hello,Wish you are doing good...              ...
2.0  The Graphic / Web Designer will manage, popula...
3.0   You?ll have to join the Moosend dojo. But, yo...

                                              col7  \
1.0  JAVA, .NET, SQL, ORACLE, SAP, Informatica, Big...
2.0  Bachelor?s degree in Graphic Design, Web Desig...
3.0  ? .NET or equivalent (Java etc.)? MVC? Javascr...

                                              col8 col9
1.0                                                NaN    f
2.0  CSD offers a competitive benefits package for ...    f
3.0  You?ll be working with the best team in town.....    f
Ola O
  • 25
  • 2
  • 7

2 Answers2

1

Option 1 - if you know the complete set of non-ascii characters:

df
Out[36]: 
         col1  col2
0  aaб§•¿µbb  abcd
1         hf4  efgh
2         xxx  ijk9

df.replace(regex=True, to_replace=['Ð', '§', '±'], value='') # incomplete here
Out[37]: 
      col1  col2
0  aa•¿µbb  abcd
1      hf4  efgh
2      xxx  ijk9

Option 2 - if you can't specify the whole set of non-ascii characters:

Consider using string.printable:

String of ASCII characters which are considered printable.

from string import printable

printable
Out[38]: '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

df.applymap(lambda y: ''.join(filter(lambda x: 
            x in string.printable, y)))
Out[14]: 
   col1  col2
0  aabb  abcd
1   hf4  asdf
2   xxx      

Note that if an element in the DataFrame is all-non-ascii, it will be replaced with just ''.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • thanks for the comment. I tried option as follows but the non-ascii characters are still in the dataframe. `df.replace(regex=True, to_replace =['¢','€','£', 'Ã', '¬', 'Ð', '±','½','©','•','¾', '§', '¥', '«', '¤', '–', 'œ', '¡', '”', '|', 'â', '™', 'Â', 'Î', '¿', 'µ', '´', '‡','»', 'Ž', '®', 'º', 'Ï', 'ƒ', '¶', '¹', '┬', 'á', 'Γ', 'Ç', 'Ö'], value='', inplace=True)` – Ola O May 31 '17 at 01:57
  • That's strange. When I use that exact operation on an example `df` with non-ascii characters, it returns `df` with them removed. What are the dtypes? Also, I'm in Python 3.5, but I don't see why that would have an effect. – Brad Solomon May 31 '17 at 02:20
  • I use python 2.7. the dtypes are object – Ola O May 31 '17 at 03:19
  • Post a snippet of what df looks like in your question. – Brad Solomon May 31 '17 at 03:20
  • The df looks like so.. but i have renamed the columns – Ola O May 31 '17 at 22:58
  • Sorry I can't help further, I'm stumped on this one – Brad Solomon Jun 01 '17 at 11:16
0

With an inspiration from Brad's answer, I solved the problem by using a list of the ascii values for [0-9][a-z][A-Z].

def remove_non_ascii(text):
L = [32, 44, 46, 65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,97,98,99,100,101,102,103, 104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122]
text = str(text)

return ''.join(i for i in text if ord(i) in L)
Ola O
  • 25
  • 2
  • 7