Extract co-occurrence data from dataframe

Question

I have something like this:

 fromJobtitle         toJobtitle         size
0              CEO                CEO    65
1              CEO     Vice President    23
2              CEO           Employee    56
3   Vice President                CEO   112
4         Employee                CEO    20

I would like to count number of co-occurences so that it combines the double occurences (showing only how many elements there are between the 2)

An example Output:

0              CEO     Vice President   135
1              CEO           Employee    76
2              CEO                CEO    65

Amin Ba · Accepted Answer · 2021-06-03T14:45:57.480

import pandas as pd
df = pd.DataFrame({
    'fromJobtitle': ['CEO', 'CEO', 'CEO', 'Vice President', 'Employee'],
    'toJobtitle': ['CEO', 'Vice President', 'Employee', 'CEO', 'CEO'],
    'size': [65, 23, 56, 112, 20]
    })

df['combination'] = df.apply(lambda row: tuple(sorted([
                                                       row['fromJobtitle'], 
                                                       row['toJobtitle']
                                                      ])), axis=1)

then:

df = df.groupby('combination').sum().reset_index()

results:

    combination             size
0   (CEO, CEO)              65
1   (CEO, Employee)         76
2   (CEO, Vice President)   135

finally:

df['from'] = df.apply(lambda row: row['combination'][0], axis=1)
df['to'] = df.apply(lambda row: row['combination'][1], axis=1)
df = df.drop('combination', axis=1)
df.head()

result:

    size    from    to
0   65      CEO     CEO
1   76      CEO     Employee
2   135     CEO     Vice President

Also, how do you remove mappings to same elements ? (CEO to CEO etc) — darklight213, Jun 03 '21 at 16:59
@darklight213 I map the first element of tuple in the `cobination` column to the `to` column and the second one to the `from` column and then drop the `combination` column — Amin Ba, Jun 03 '21 at 17:02

score 2 · Answer 2 · answered Jun 03 '21 at 13:45

2

Try:

df.groupby(lambda x: tuple(sorted(df.loc[x, ['fromJobTitle', 'toJobTitle']]))).sum()

Here is the result:

                       size
(CEO, CEO)               65
(CEO, Employee)          76
(CEO, Vice President)   135

answered Jun 03 '21 at 13:45

Riccardo Bucco

13,980
4
22
50

score 2 · Answer 3 · answered Jun 03 '21 at 13:45

Here's a different solution:

First create a column which combines the names in alphabetical order

df['titles'] = np.where(df['fromJobtitle']<df['toJobtitle'],df['fromJobtitle']+"|"+df['toJobtitle'],df['toJobtitle']+"|"+df['fromJobtitle'])

0               CEO|CEO
1    CEO|Vice President
2          CEO|Employee
3    CEO|Vice President
4          CEO|Employee
Name: titles, dtype: object

Then group by that name and sum it

df_groups = df.groupby('titles').sum().reset_index()
df_groups

titles              size
CEO|CEO             65
CEO|Employee        76
CEO|Vice President  135

Then just split the combined name into individual pieces

df_groups[['fromJobTitle', 'toJobTitle']] = df_groups['titles'].str.split('|', expand=True)
df_groups

size fromJobTitle   toJobTitle
65    CEO           CEO
76    CEO           Employee
135   CEO           Vice President

Extract co-occurrence data from dataframe

3 Answers3