1

I have the below panda dataframe

enter image description here

Now I want to sum the count for [user 1][user 2] and [user 2][user 1]. For example, the count value for [user 1 = 1][user 2 = 92] is count = 1 and [user 1 = 92][user 2 = 1] is count = 2, so I want the final output as user 1 = 1, user 2 = 92 and count = 3. Also, there might not be [user 2][user 1] for every [user 1][user 2]. For example, for [user 1 = 29][user 2 = 56] count value is 3 but there is no [user 1 = 56][user 2 = 29], so the final output would be user 1 = 29, user 2 = 56 and count = 3.

So final output should look like this

enter image description here

Can someone please guide me how to do this?

aries
  • 849
  • 3
  • 11
  • 24

1 Answers1

1

A solution grouping sorted tuples:

df['users'] = list(zip(df.user1,df.user2))
df['users'] = df.users.apply(lambda t:tuple(sorted(t)))
counts = df.groupby('users', as_index=False)['count'].sum()
counts[['user1','user2']] = counts.users.apply(pd.Series)
counts = counts[['user1','user2','count']] # Select and reorder columns
Adrien Matissart
  • 1,610
  • 15
  • 19
  • Thanks, it worked great. Can you please explain the code a bit? – aries Jul 08 '17 at 10:04
  • 1
    Sure. First we build a column `users` that contains unique (sorted) pairs of users. A tuple is required (instead of a list) because we need hashable values to use `groupby`. Then we group our tuples and sum the `count` column. `as_index=False` is useful here to keep the tuples in a separate column and process it later. Finally we apply `pd.Series` to split tuples into 2 separate columns again. – Adrien Matissart Jul 08 '17 at 10:11