1

Have a pyspark dataframe with one column title is all string. Need to find all the rows which contain any of the following list of words ['Cars','Car','Vehicle','Vehicles']. Need to filter to find only rows which contain word only from this list. One way to do this like:

filter_1 = "title like '%{}' or title like '%{}' or title like '%{}' or title like '%{}'"\
    .format('Car','Cars','Vehicle','Vehicles')
    
df1 = df.filter(filter_1).select('id','title')

This is not a neat way to write. Tried use regular expression:

df2 = df.where('title rlike "\bCars?\b|\bVehicles?\b"').select('id','title')

Only need to match exact word like 'Car' not 'sCar' or 'Carry'. but df2 is empty.

Also tried How to efficiently check if a list of words is contained in a Spark Dataframe? there are still some extra strings like 'sCar' or 'Carry'. Any suggestions?

newleaf
  • 2,257
  • 8
  • 32
  • 52

1 Answers1

0

Use where to filter the df. To do that, join search words with |

s='|'.join(["(" + c +")" for c in l])
df.where(df['title'].rlike(s)).show()
wwnde
  • 26,119
  • 6
  • 18
  • 32