How to remove non UTF-8 characters from Pandas columns

Asked Jun 24 '19 at 22:35

Active Jun 24 '19 at 22:35

Viewed 2,824 times

This is a follow up this this question

Which tells how to remove non ASCII characters from Pandas columns

 df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

From the UTF-8 wikipedia, UTF-8 is

The first 128 characters of Unicode

So my guess is that the solution would be

 df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) > 127 else i for i in x]))

asked Jun 24 '19 at 22:35

SantoshGupta7

1

Try using the str methods instead: `df["DB_user"].str.encode('utf-8', 'ignore').str.decode('utf-8')`. – cs95 Jun 24 '19 at 22:39
1

Also, you're probably misunderstanding what the wikipedia article says. The reason for mentioning the first 128 characters of unicode is to make the point that "valid ASCII text is valid UTF-8-encoded Unicode as well.". Utf-8 supports much much more than 128 characters. – cs95 Jun 24 '19 at 22:42

0 Answers0