Remove non-ASCII characters from pandas column

Question

I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:


+-----------------------------------------------------------
|      DB_user                            source   count  |                                             
+-----------------------------------------------------------
| ???/"Ò|Z?)?]??C %??J                      A        10   |                                       
| ?D$ZGU   ;@D??_???T(?)                    B         3   |                                       
| ?Q`H??M'?Y??KTK$?Ù‹???Ð©JL4??*?_??        C         2   |                                        
+-----------------------------------------------------------

I was using this function, which I had come across while researching the problem on SO.

def filter_func(string):
   for i in range(0,len(string)):


      if (ord(string[i])< 32 or ord(string[i])>126
           break

      return ''

And then using the apply function:

df['DB_user'] = df.apply(filter_func,axis=1)

I keep getting the error:


'ord() expected a character, but string of length 66 found', u'occurred at index 2'

However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.

Could somebody help me out?

Thanks!

MaxU - stand with Ukraine · Answer 1 · 2016-03-31T18:27:48.820

43

you may try this:

df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

edited Mar 31 '16 at 18:27

answered Mar 31 '16 at 18:21

MaxU - stand with Ukraine

205,989
36
386
419

6

great answer, this can also be used for the entire DataFrame. – elPastor Aug 30 '17 at 11:29
4

This performs a slightly different task than the one illustrated in the question — it accepts all ASCII characters, whereas the sample code in the question rejects non-printable characters by starting at character 32 rather than 0. The characters `\x00` can be replaced with a single space to make this answer match the accepted answer in its behavior. – Brandon Rhodes Feb 13 '20 at 16:17

score 36 · Answer 2 · answered Jun 24 '19 at 22:48

A common trick is to perform ASCII encoding with the errors="ignore" flag, then subsequently decoding it into ASCII:

df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')

From python3.x and above, this is my recommended solution.

Minimal Code Sample

s = pd.Series(['Déjà vu', 'Ò|zz', ';test 123'])
s

0      Déjà vu
1         Ò|zz
2    ;test 123
dtype: object


s.str.encode('ascii', 'ignore').str.decode('ascii')

0        Dj vu
1          |zz
2    ;test 123
dtype: object

P.S.: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).

Padraic Cunningham · Accepted Answer · 2016-03-31T19:49:33.950

13

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the join using a chained comparison:

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You could also use string.printable to filter the chars:

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

The fastest is to use translate:

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly that is faster than:

  df['DB_user'] = df["DB_user"].str.translate(trans)

edited Mar 31 '16 at 19:49

answered Mar 31 '16 at 18:31

Padraic Cunningham

176,452
29
245
321

@red_devil, no prob, there are many different ways to do it but knowing where you went wrong is important. – Padraic Cunningham Mar 31 '16 at 19:13
It is worth to note that the above does not work for python 3 (e.g. you cannot do `range(..) + range(..)`) – Floran Gmehlin Feb 28 '20 at 15:45
1

`t` is undefined in the 3rd and 4th methods – Dimgold Jun 15 '20 at 18:56
worth noting that `maketrans` is a now a method on the `str` builtin type. – Idr Jul 26 '21 at 13:38

score 3 · Answer 4 · answered Aug 08 '19 at 21:20

A couple of the answers given here aren't correct. Simple validation:

s = pd.Series([chr(x) for x in range(256)])
s.loc[0]
>> '\x00'
s.replace({r'[^\x00-\x7F]+':''}, regex=True).loc[0]
>> '\x00'  # FAIL
s.str.encode('ascii', 'ignore').str.decode('ascii').loc[0]
>> '\x00'  # FAIL
s.apply(lambda x: ''.join([i if 32 < ord(i) < 126 else " " for i in x])).loc[0]
>> ' '  # Success!
import string
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[0]
>> ' '  # Looks good, but...
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[11]
>> '\x0b'  # FAIL
del_chars =  " ".join([chr(i) for i in list(range(32)) + list(range(127, 256))])
trans = str.maketrans(del_chars, " " * len(del_chars))
s.apply(lambda x: x.translate(trans)).loc[11]
>> ' '  # Success!

Conclusion: only the options in the accepted answer (from Padraic Cunningham) work reliably. There are some bizarre Python errors/typos in his second answer, amended here, but otherwise it should be the fastest.

score 0 · Answer 5 · edited Jul 26 '21 at 14:07

0

This worked for me. Given the series has some NaN values, it performs only on strings:

from string import printable

import pandas as pd

df["text_data"] = df["text_data"].str.split().str.join(' ')

df["text_data"] = df["text_data"].apply(lambda string_var: ''.join(filter(lambda y: y in printable, string_var)) if isinstance(string_var, str) else string_var)

edited Jul 26 '21 at 14:07

cavalcantelucas

1,362
3
12
34

answered Jul 26 '21 at 13:04

Miguel Lopez

1
1

score 0 · Answer 6 · answered Jul 26 '21 at 13:34

0

from string import printable

def printable_mapper(x): 
    return ''.join([_ if _ in printable else " " for _ in x])

df.DB_user = df.DB_user.map(printable_mapper)

answered Jul 26 '21 at 13:34

Idr

6,000
6
34
49

score 0 · Answer 7 · answered Jan 22 '22 at 18:03

0

Here is a one liner that I use:

df = df.replace(to_replace="/[^ -~]+/g", value="", regex=True)

Using regex, it globally removes characters not in the range of ' '(space) and ~

answered Jan 22 '22 at 18:03

Tony Shouse

106
5

score -1 · Answer 8 · answered Feb 27 '19 at 22:12

-1

This worked for me:

import re
def replace_foreign_characters(s):
    return re.sub(r'[^\x00-\x7f]',r'', s)

df['column_name'] = df['column_name'].apply(lambda x: replace_foreign_characters(x))

answered Feb 27 '19 at 22:12

Justin Malinchak

509
1
6
11

Remove non-ASCII characters from pandas column

8 Answers8

Linked

Related