0

I have a dataset which has around 400k rows. I need to find common words between question1 and question2 columns. I am able to print the output with a zip and for loop, however I would like to create a function to return these values. Can you please help me?

for a, b in zip(df.question1, df.question2):
    str1 = (set(a.lower().strip().split()))
    str2 = (set(b.lower().strip().split()))
    word_common =  (len(str1 & str2))
    word_total = len(str1) + len(str2)
    word_share = round(word_common/word_total,2)
    print(word_common,word_total,word_share)

This prints the output:

10 23 0.43
4 20 0.2
4 24 0.17

However, when i wrap this inside a function I get only one value (i.e. word_common) based on where i place return keyword. How can I store this output in a dataframe?

def find_common_words(df,strg1,strg2):
    for a, b in zip(df[strg1], df[strg2]):
        str1 = (set(a.lower().strip().split()))
        str2 = (set(b.lower().strip().split()))
        word_common =  (len(str1 & str2))
        word_total = len(str1) + len(str2)
        word_share = round(word_common/word_total,2)
        return word_common
Ji Wei
  • 840
  • 9
  • 19
  • 1
    You can use [apply](https://stackoverflow.com/questions/33506826/call-function-use-apply-in-python) with axis=1 to parse every row and call a function inside where you perform your desired operation. – Raghul Raj Mar 19 '20 at 15:18
  • 2
    append `word_common` to an empty list and return the list outside the for-loop – It_is_Chris Mar 19 '20 at 15:20

2 Answers2

1

When you run return, the process in the function is stopped and the value is returned. So after the first iteration in your loop the program is stopped, because of your return statement, and the first value of word_common is returned. You sould rather stock your values in a list.

Secondly, as you have a DataFrame you should use apply function in order to output your list. It will take in input a function and will apply it on each row of the DataFrame.

In the following code, the value of word_common will be stocked in a new column of your DataFrame, named word_common:

def parse_one_row(row):
    a = row['question1']
    b = row['question2'] 
    str1 = (set(a.lower().strip().split()))
    str2 = (set(b.lower().strip().split()))
    word_common =  (len(str1 & str2))
    word_total = len(str1) + len(str2)
    word_share = round(word_common/word_total,2)
    return (word_common, word_total, word_share)


df['word_common'] = df.apply(parse_one_row, axis=1).apply(lambda x: x[0], axis=1)

Here you have the official documentation

Catalina Chircu
  • 1,506
  • 2
  • 8
  • 19
  • Thank you. This works too. why i need to use axis =1? i did not understand – aloneonthe_edge Mar 19 '20 at 15:35
  • You need `axis=1`, which means apply the function on each row, because axis is by default set to zero, and in this case it will apply on each column. Just a piece of advice: Use the apply function rather than for loops, because it is optimized and is faster. A for loop can be very long and take a lot of memory if you deal with huge datasets, in other words, a real catastrophy. – Catalina Chircu Mar 19 '20 at 15:39
  • Thank you Catalina. Makes sense. – aloneonthe_edge Mar 19 '20 at 15:41
  • Hello Catalina, In the above code if i need to retrieve 'word_total','word_share',and 'word_common', how can i do that? also, how can i add all of them to the dataframe? – aloneonthe_edge Mar 27 '20 at 03:28
  • In the function you return a tuple `(word_common, word_total, word_share)` and at a second stage you return the first element for `word common` with `x[0]` in the lambda function, `word total` as the second element with `x[1]` and so on. – Catalina Chircu Mar 27 '20 at 06:38
  • Is there no way to create a df inside the function and call that? – aloneonthe_edge Mar 27 '20 at 07:21
  • Creating the DataFrame inside the function means creating it each time which you do not want. What I suggest in my code is to create it outside the function and at each iteration add a new column (according to what you want to do). What you suggest is bad practice. – Catalina Chircu Mar 27 '20 at 09:10
  • Thank you Catalina. It was of immense help – aloneonthe_edge Mar 27 '20 at 09:36
0

Use this to return the values in a dataframe:

def find_common_words(df,strg1,strg2):
    stats = []
    for a, b in zip(df[strg1], df[strg2]):
        str1 = (set(a.lower().strip().split()))
        str2 = (set(b.lower().strip().split()))
        word_common =  (len(str1 & str2))
        word_total = len(str1) + len(str2)
        word_share = round(word_common/word_total,2)
        stats += [[word_common, word_total, word_share]]
    return pd.DataFrame(stats, columns=['Word Common', 'Word Total', 'Word Share'])

Ji Wei
  • 840
  • 9
  • 19