Using PySpark dataframes I'm trying to do the following as efficiently as possible. I have a dataframe with a column which contains text and a list of words I want to filter rows by. So:
Dataframe would look like this
df:
col1 col2 col_with_text
a b foo is tasty
12 34 blah blahhh
yeh 0 bar of yums
The list will be list = [foo,bar]
And thus result will be:
result:
col1 col2 col_with_text
a b foo
yeh 0 bar
Afterwards not only identical string matching will be done but also tested for similarity by using SequenceMatcher or so. This is what I already tried:
def check_keywords(x):
words_list = ['foo','bar']
for word in x
if word == words_list[0] or word == words_list[1]:
return x
result = df.map(lambda x: check_keywords(x)).collect()
Unfortunately I was unsuccesfull, could someone help me out? Thanks in advance.