pyspark string match multiple exact words regular expression efficient way

Question

Have a pyspark dataframe with one column title is all string. Need to find all the rows which contain any of the following list of words ['Cars','Car','Vehicle','Vehicles']. Need to filter to find only rows which contain word only from this list. One way to do this like:

filter_1 = "title like '%{}' or title like '%{}' or title like '%{}' or title like '%{}'"\
    .format('Car','Cars','Vehicle','Vehicles')
    
df1 = df.filter(filter_1).select('id','title')

This is not a neat way to write. Tried use regular expression:

df2 = df.where('title rlike "\bCars?\b|\bVehicles?\b"').select('id','title')

Only need to match exact word like 'Car' not 'sCar' or 'Carry'. but df2 is empty.

Also tried How to efficiently check if a list of words is contained in a Spark Dataframe? there are still some extra strings like 'sCar' or 'Carry'. Any suggestions?

wwnde · Answer 1 · 2021-07-20T02:21:20.450

0

Use where to filter the df. To do that, join search words with |

s='|'.join(["(" + c +")" for c in l])
df.where(df['title'].rlike(s)).show()

edited Jul 20 '21 at 02:21

answered Jul 20 '21 at 01:47

wwnde

26,119
6
18
32

How about only match car , not something like carry or scar? – newleaf Jul 21 '21 at 00:04

pyspark string match multiple exact words regular expression efficient way

1 Answers1