I have a dataset which has around 400k rows. I need to find common words between question1
and question2
columns. I am able to print the output with a zip
and for
loop, however I would like to create a function to return these values. Can you please help me?
for a, b in zip(df.question1, df.question2):
str1 = (set(a.lower().strip().split()))
str2 = (set(b.lower().strip().split()))
word_common = (len(str1 & str2))
word_total = len(str1) + len(str2)
word_share = round(word_common/word_total,2)
print(word_common,word_total,word_share)
This prints the output:
10 23 0.43
4 20 0.2
4 24 0.17
However, when i wrap this inside a function I get only one value (i.e. word_common
) based on where i place return
keyword. How can I store this output in a dataframe?
def find_common_words(df,strg1,strg2):
for a, b in zip(df[strg1], df[strg2]):
str1 = (set(a.lower().strip().split()))
str2 = (set(b.lower().strip().split()))
word_common = (len(str1 & str2))
word_total = len(str1) + len(str2)
word_share = round(word_common/word_total,2)
return word_common