Have a pyspark dataframe with one column title
is all string. Need to find all the rows which contain any of the following list of words ['Cars','Car','Vehicle','Vehicles']
.
Need to filter to find only rows which contain word only from this list.
One way to do this like:
filter_1 = "title like '%{}' or title like '%{}' or title like '%{}' or title like '%{}'"\
.format('Car','Cars','Vehicle','Vehicles')
df1 = df.filter(filter_1).select('id','title')
This is not a neat way to write. Tried use regular expression:
df2 = df.where('title rlike "\bCars?\b|\bVehicles?\b"').select('id','title')
Only need to match exact word like 'Car' not 'sCar' or 'Carry'. but df2 is empty.
Also tried How to efficiently check if a list of words is contained in a Spark Dataframe? there are still some extra strings like 'sCar' or 'Carry'. Any suggestions?