Python: determine if three text strings stored in a dataframe have any words in common

Question

Say I have the following dataframe df:

      A             B               C
0     mom;dad;son;  sister;son;     yes;no;maybe;
1     dad;          daughter;niece; no;snow;
2     son;dad;      cat;son;dad;    tree;dad;son;
3     daughter;mom; niece;          referee;
4     dad;daughter; cat;            dad;

And you want to check if, between columns A, B, and C, there is a common word, and create a column D with 1 if there is and 0 if there isn't any. For a word to be common, it's enough for it to appear in just two of the three columns.

The outcome should be:

      A             B               C              D
0     mom;dad;son;  sister;son;     yes;no;maybe;  1
1     dad;          daughter;niece; no;snow;       0
2     son;dad;      cat;son;dad;    tree;dad;son;  1
3     daughter;mom; niece;          referee;       0
4     dad;daughter; cat;            dad;           1

I am trying to implement this by doing:

for index, row in df.iterrows():

    w1=row['A'].split(';')
    w2=row['B'].split(';')
    w3=row['C'].split(';')

    if len(set(w1).intersection(w2))>0 or len(set(w1).intersection(w3))>0 or len(set(w2).intersection(w3))>0:
        df['D'][index]==1
    else:
        df['D'][index]==0

However, the resulting D column only bears 0 because (possibly) I am not comparing each individual word in w1 to the others in w2 and w3. How could I achieve this?

piRSquared · Accepted Answer · 2018-05-09T13:13:22.857

Use stack + pandas.Series.str.get_dummies

df.assign(
    D=df.stack().str.get_dummies(';').sum(level=0).gt(1).any(1).astype(int)
)

               A                B              C  D
0   mom;dad;son;      sister;son;  yes;no;maybe;  1
1           dad;  daughter;niece;       no;snow;  0
2       son;dad;     cat;son;dad;  tree;dad;son;  1
3  daughter;mom;           niece;       referee;  0
4  dad;daughter;             cat;           dad;  1

Details

Notice that when we stack and get dummies, the interim result looks like this:

     cat  dad  daughter  maybe  mom  niece  no  referee  sister  snow  son  tree  yes
0 A    0    1         0      0    1      0   0        0       0     0    1     0    0
  B    0    0         0      0    0      0   0        0       1     0    1     0    0
  C    0    0         0      1    0      0   1        0       0     0    0     0    1
1 A    0    1         0      0    0      0   0        0       0     0    0     0    0
  B    0    0         1      0    0      1   0        0       0     0    0     0    0
  C    0    0         0      0    0      0   1        0       0     1    0     0    0
2 A    0    1         0      0    0      0   0        0       0     0    1     0    0
  B    1    1         0      0    0      0   0        0       0     0    1     0    0
  C    0    1         0      0    0      0   0        0       0     0    1     1    0
3 A    0    0         1      0    1      0   0        0       0     0    0     0    0
  B    0    0         0      0    0      1   0        0       0     0    0     0    0
  C    0    0         0      0    0      0   0        1       0     0    0     0    0
4 A    0    1         1      0    0      0   0        0       0     0    0     0    0
  B    1    0         0      0    0      0   0        0       0     0    0     0    0
  C    0    1         0      0    0      0   0        0       0     0    0     0    0

Where the prior columns are embedded in the second level of the index. So I want to sum over the first level in order to see how many times that word appears.

That summation interim looks like:

   cat  dad  daughter  maybe  mom  niece  no  referee  sister  snow  son  tree  yes
0    0    1         0      1    1      0   1        0       1     0    2     0    1
1    0    1         1      0    0      1   1        0       0     1    0     0    0
2    1    3         0      0    0      0   0        0       0     0    3     1    0
3    0    0         1      0    1      1   0        1       0     0    0     0    0
4    1    2         1      0    0      0   0        0       0     0    0     0    0

Notice that we catch 'son' in row 1, 'dad' and 'son' in row 3 and so on.

If it appears in more than 1 column (hence gt(1)) then I want to count it as a 1 (hence any(1).astype(int)).

This is an exceptionally neat piece of code that might benefit from a touch more explanation. — asongtoruin, May 09 '18 at 13:07
How does the `D=df.stack()` part work if your column has a name such as `There are spaces in this header`? Also, how do you apply this to a selection of an existing dataframe, e.g. without creating a new one? — FaCoffee, May 09 '18 at 13:41
@FaCoffee you can do a couple of things. One, you can drop the results of `df.stack().str.get_dummies(';').sum(level=0).gt(1).any(1).astype(int)` directly into the dataframe with `df['Name with spaces'] = ...` or you can continue to leverage the `assign` method by unpacking a dictionary `df.assign(**{'Name with spaces': df.stack().str.get_dummies(';').sum(level=0).gt(1).any(1).astype(int)})` — piRSquared, May 09 '18 at 13:48
Can you have a list in place of `Name with spaces` if you have multiple columns to process that have spaces in their headers? — FaCoffee, May 09 '18 at 13:52
Ahh, I think you mean that your example dataframe with columns `['A', 'B', 'C']` is actually `['Col 1', 'Col 2', 'Col 3']`? If I'm right, then yes, your suggestion should work. `df[['Col 1', 'Col 2', 'Col 3']].stack()...` — piRSquared, May 09 '18 at 13:55

zipa · Answer 2 · 2018-05-09T13:15:43.060

4

This one-liner creates what you need, using collections.Counter:

from collections import Counter

df['D'] = df.applymap(lambda x: [i for i in x.split(';') if i]).apply(lambda x: int(Counter(x.A+x.B+x.C).most_common(1)[0][1]!=1), axis=1)

edited May 09 '18 at 13:15

answered May 09 '18 at 13:09

zipa

27,316
6
40
58

score 3 · Answer 3 · answered May 09 '18 at 13:16

3

You can use your code by fixing a typo: replace == with =.

answered May 09 '18 at 13:16

haklir

76
4

score 1 · Answer 4 · edited May 09 '18 at 13:51

1

Try this:

def find_common(row):
     A_list=set(row['A'].rsplit(';')[:-1])
     B_list=set(row['B'].rsplit(';')[:-1])
     C_list=set(row['C'].rsplit(';')[:-1])
     if ((len(A_list.intersection(B_list))) or (len(B_list.intersection(C_list))) or (len(A_list.intersection(C_list)))):
        return 1
     else: 
        return 0

df['D']=df.apply(find_common,axis=1)

edited May 09 '18 at 13:51

FaCoffee

7,609
28
99
174

answered May 09 '18 at 13:37

jazib jamil

542
4
13

Python: determine if three text strings stored in a dataframe have any words in common

4 Answers4

Details