Filter pandas dataframe rows containing non-ascii values

Question

I have a column with addresses and want to find all rows that contain 'foreign' i.e. non-ASCII characters.

import pandas as pd

df = pd.DataFrame.from_dict({
    'column_name': ["GREENLAND HOTEL, CENTRAL AVENUE, NAGPUR-440 018.", "Møllegade 1234567 DK-6400 Sønderborg Denmark"],
    'column_other': ["0", "1"]
})

Expected output is it will display only the 2nd row which contains the "ø" character.

I tried this:

df['column_name'].str.isascii()

but in Python 3.6 at least this does not work.

In MySQL I can do this equivalent

SELECT * FROM `table_name` WHERE `column_name`!=CONVERT(`column_name` USING ASCII)

which works.

@anky_91 Great suggestion. I added a test case dataframe with one row that should match and one that shouldn't. — Pranab, Jan 21 '20 at 03:21

Ben.T · Answer 1 · 2020-01-21T03:58:47.207

Not sure about the performance, but you can try to encode the string and ignore the errors, then compare the len of the encoded string to the original one like:

mask_nonAscii = df.column_name.map(lambda x: len(x.encode('ascii', errors='ignore')) != len(x))
print (df[mask_nonAscii])
                                    column_name column_other
1  Møllegade 1234567 DK-6400 Sønderborg Denmark            1

EDIT: thanks to a comment of anky_91 you can also use str.len and str.encode from pandas like:

mask_nonAscii = df['column_name'].str.len()\
                 .ne(df['column_name'].str.encode('ascii',errors = 'ignore').str.len())

at least the comparison between lengths is vectorize

Also, I have not seen any documentation about a method in pandas.Series.str.isascii, but if you use python 3.7+, maybe you can create the mask this way:

mask_nonAscii = ~df.column_name.map(str.isascii)

score 1 · Answer 2 · answered Jun 30 '22 at 17:37

In this example, the dataframe is named data

First of all make a list of columns of string datatype

cols = ["A", "B", "C"]
Run the code below to loop through the columns to state the number of values in each column that have the non-ascii characters

for col in cols:    
    print(col, data[col].str.len().ne(data[col].str.encode("ascii", errors="ignore").str.len()).sum())

A 0
B 3
C 0

For this example, column B has the non-ascii values

Run the modified code below to get the rows with the non-ascii values

data["B"].str.len().ne(data["B"].str.encode("ascii", errors="ignore").str.len())

Filter pandas dataframe rows containing non-ascii values

2 Answers2