37

I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:


+-----------------------------------------------------------
|      DB_user                            source   count  |                                             
+-----------------------------------------------------------
| ???/"Ò|Z?)?]??C %??J                      A        10   |                                       
| ?D$ZGU   ;@D??_???T(?)                    B         3   |                                       
| ?Q`H??M'?Y??KTK$?ً???ЩJL4??*?_??        C         2   |                                        
+-----------------------------------------------------------

I was using this function, which I had come across while researching the problem on SO.

def filter_func(string):
   for i in range(0,len(string)):


      if (ord(string[i])< 32 or ord(string[i])>126
           break

      return ''

And then using the apply function:

df['DB_user'] = df.apply(filter_func,axis=1)

I keep getting the error:


'ord() expected a character, but string of length 66 found', u'occurred at index 2'

However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.

Could somebody help me out?

Thanks!

cs95
  • 379,657
  • 97
  • 704
  • 746
red_devil
  • 1,009
  • 2
  • 13
  • 23

8 Answers8

43

you may try this:

df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 6
    great answer, this can also be used for the entire DataFrame. – elPastor Aug 30 '17 at 11:29
  • 4
    This performs a slightly different task than the one illustrated in the question — it accepts all ASCII characters, whereas the sample code in the question rejects non-printable characters by starting at character 32 rather than 0. The characters `\x00` can be replaced with a single space to make this answer match the accepted answer in its behavior. – Brandon Rhodes Feb 13 '20 at 16:17
36

A common trick is to perform ASCII encoding with the errors="ignore" flag, then subsequently decoding it into ASCII:

df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')

From python3.x and above, this is my recommended solution.


Minimal Code Sample

s = pd.Series(['Déjà vu', 'Ò|zz', ';test 123'])
s

0      Déjà vu
1         Ò|zz
2    ;test 123
dtype: object


s.str.encode('ascii', 'ignore').str.decode('ascii')

0        Dj vu
1          |zz
2    ;test 123
dtype: object

P.S.: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).

cs95
  • 379,657
  • 97
  • 704
  • 746
13

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the join using a chained comparison:

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You could also use string.printable to filter the chars:

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

The fastest is to use translate:

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly that is faster than:

  df['DB_user'] = df["DB_user"].str.translate(trans)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
3

A couple of the answers given here aren't correct. Simple validation:

s = pd.Series([chr(x) for x in range(256)])
s.loc[0]
>> '\x00'
s.replace({r'[^\x00-\x7F]+':''}, regex=True).loc[0]
>> '\x00'  # FAIL
s.str.encode('ascii', 'ignore').str.decode('ascii').loc[0]
>> '\x00'  # FAIL
s.apply(lambda x: ''.join([i if 32 < ord(i) < 126 else " " for i in x])).loc[0]
>> ' '  # Success!
import string
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[0]
>> ' '  # Looks good, but...
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[11]
>> '\x0b'  # FAIL
del_chars =  " ".join([chr(i) for i in list(range(32)) + list(range(127, 256))])
trans = str.maketrans(del_chars, " " * len(del_chars))
s.apply(lambda x: x.translate(trans)).loc[11]
>> ' '  # Success!

Conclusion: only the options in the accepted answer (from Padraic Cunningham) work reliably. There are some bizarre Python errors/typos in his second answer, amended here, but otherwise it should be the fastest.

Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
0

This worked for me. Given the series has some NaN values, it performs only on strings:

from string import printable

import pandas as pd

df["text_data"] = df["text_data"].str.split().str.join(' ')

df["text_data"] = df["text_data"].apply(lambda string_var: ''.join(filter(lambda y: y in printable, string_var)) if isinstance(string_var, str) else string_var)
cavalcantelucas
  • 1,362
  • 3
  • 12
  • 34
0
from string import printable

def printable_mapper(x): 
    return ''.join([_ if _ in printable else " " for _ in x])

df.DB_user = df.DB_user.map(printable_mapper)
Idr
  • 6,000
  • 6
  • 34
  • 49
0

Here is a one liner that I use:

df = df.replace(to_replace="/[^ -~]+/g", value="", regex=True)

Using regex, it globally removes characters not in the range of ' '(space) and ~

Tony Shouse
  • 106
  • 5
-1

This worked for me:

import re
def replace_foreign_characters(s):
    return re.sub(r'[^\x00-\x7f]',r'', s)

df['column_name'] = df['column_name'].apply(lambda x: replace_foreign_characters(x))
Justin Malinchak
  • 509
  • 1
  • 6
  • 11