3

I have a column with addresses and want to find all rows that contain 'foreign' i.e. non-ASCII characters.

import pandas as pd

df = pd.DataFrame.from_dict({
    'column_name': ["GREENLAND HOTEL, CENTRAL AVENUE, NAGPUR-440 018.", "Møllegade 1234567 DK-6400 Sønderborg Denmark"],
    'column_other': ["0", "1"]
})

Expected output is it will display only the 2nd row which contains the "ø" character.

I tried this:

df['column_name'].str.isascii()

but in Python 3.6 at least this does not work.

In MySQL I can do this equivalent

SELECT * FROM `table_name` WHERE `column_name`!=CONVERT(`column_name` USING ASCII)

which works.

Pranab
  • 2,207
  • 5
  • 30
  • 50
  • 1
    @anky_91 Great suggestion. I added a test case dataframe with one row that should match and one that shouldn't. – Pranab Jan 21 '20 at 03:21

2 Answers2

2

Not sure about the performance, but you can try to encode the string and ignore the errors, then compare the len of the encoded string to the original one like:

mask_nonAscii = df.column_name.map(lambda x: len(x.encode('ascii', errors='ignore')) != len(x))
print (df[mask_nonAscii])
                                    column_name column_other
1  Møllegade 1234567 DK-6400 Sønderborg Denmark            1

EDIT: thanks to a comment of anky_91 you can also use str.len and str.encode from pandas like:

mask_nonAscii = df['column_name'].str.len()\
                 .ne(df['column_name'].str.encode('ascii',errors = 'ignore').str.len())

at least the comparison between lengths is vectorize


Also, I have not seen any documentation about a method in pandas.Series.str.isascii, but if you use python 3.7+, maybe you can create the mask this way:

mask_nonAscii = ~df.column_name.map(str.isascii)
Ben.T
  • 29,160
  • 6
  • 32
  • 54
1

In this example, the dataframe is named data

  1. First of all make a list of columns of string datatype

    cols = ["A", "B", "C"]

  2. Run the code below to loop through the columns to state the number of values in each column that have the non-ascii characters

for col in cols:    
    print(col, data[col].str.len().ne(data[col].str.encode("ascii", errors="ignore").str.len()).sum())

A 0
B 3
C 0

For this example, column B has the non-ascii values

  1. Run the modified code below to get the rows with the non-ascii values

    data["B"].str.len().ne(data["B"].str.encode("ascii", errors="ignore").str.len())

tejkweku
  • 19
  • 1
  • 7