2

I have a dataframe of users and their friends that looks like:

user_id | friend_id
1         3
1         4
2         3
2         5
3         4

I want to write a function in python to compute the number of mutual friends for each pair:

user_id | friend_id | num_mutual
1         3           1
1         4           1
2         3           0
2         5           0
3         4           1

Currently I have:

def find_mutual(df):
    num_mutual = []
    for i in range(len(df)):
        user, friend = df.loc[i, 'user_id'], df.loc[i, 'friend_id']
        user_list = df[df.user_id == user].friend_id.tolist() + df[df.friend_id == user].user_id.tolist()
        friend_list = df[df.user_id == friend].friend_id.tolist() + df[df.friend_id == friend].user_id.tolist()
        mutual = len(list(set(user_list) & set(friend_list)))
        num_mutual.append(mutual)
    return num_mutual

It works fine for small datasets, but I'm running it on a dataset with millions of rows. It takes forever to run everything. I know it's not the ideal way to find the count. Is there a better algorithm in Python? Thanks in advance!

CWuu
  • 23
  • 1
  • 3
  • 1
    For ```n``` friends, you're effectively creating an ```n^2``` table. Very expensive computationally, irrespective of the algorithm – Abhinav Mathur Oct 17 '20 at 17:54
  • I think you really have two different questions. First, is there a better algorithm to use for this problem which requires less than memory than the n^2 table and will run in something closer to O(n) time. The second question to be asked is there a Python library that can be used to implement this algorithm. While I don't have a ready answer to either question, you might think about utilizing dynamic programming techniques to break the problem down into smaller pieces. – itprorh66 Oct 17 '20 at 18:13
  • Further thoughts. You might consider your data frame as a list of list of graph edges and then look at solving the problem as [Disjoint Set](https://www.geeksforgeeks.org/union-find/) – itprorh66 Oct 17 '20 at 18:28
  • Thank you for your comments and suggestions! – CWuu Oct 18 '20 at 00:54

2 Answers2

3

The [ugly] idea is to construct a 4 point path that starts with a user_id and ends with the same user_id. If such a path exists, then 2 starting points have mutual friends.

We start with:

df
          user_id  friend_id
0        1          3
1        1          4
2        2          3
3        2          5
4        3          4

Then you can do:

dff = df.append(df.rename(columns={"user_id":"friend_id","friend_id":"user_id"}))
df_new = dff.merge(dff, on="friend_id", how="outer")
df_new = df_new[df_new["user_id_x"]!= df_new["user_id_y"]]
df_new = df_new.merge(dff, left_on= "user_id_y", right_on="user_id")
df_new = df_new[df_new["user_id_x"]==df_new["friend_id_y"]]
df_out = df.merge(df_new, left_on=["user_id","friend_id"], right_on=["user_id_x","friend_id_x"], how="left",suffixes=("__","_"))
df_out["count"] = (~df_out["user_id_x"].isnull()).astype(int)
df_out[["user_id__","friend_id","count"]]

   user_id__  friend_id  count
0          1          3      1
1          1          4      1
2          2          3      0
3          2          5      0
4          3          4      1

A more elegant and straightforward way to use a graph approach

import networkx as nx
g = nx.from_pandas_edgelist(df, "user_id","friend_id")
nx.draw_networkx(g)

enter image description here

Then you can identify number of mutual friends as number of paths for 2 adjacent nodes (2 friends) for which a 3 node path exists:

from networkx.algorithms.simple_paths import all_simple_paths
for row in df.itertuples():
    df.at[row[0],"count"] = sum([len(l)==3 for l in list(all_simple_paths(g, row[1], row[2]))])
print(df)
   user_id  friend_id  count
0        1          3    1.0
1        1          4    1.0
2        2          3    0.0
3        2          5    0.0
4        3          4    1.0
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
0

First create an adjacency list in the form of dictionary to contain the data

db = dict()  # adjacency list
num = int(input("Enter number of friends = "))
for i in range(num):
    friend = input("Enter name = ")
    db[friend] = input("Enter his/her friends name separated by space = ").split()

To find the number of mutual friends between two people, compare their lists of friends and counts the number of friends that they have in common. Here is an example of how you could do this:

def num_mutual_friends(friend1, friend2):
set1 = set(friend1)
set2 = set(friend2)
mutual_friends = set1 & set2  # intersection(common friends)
return len(mutual_friends)

Test the function as follows:

friend1, friend2 = input("Enter two names separated by space = ").split()
if friend1 in db and friend2 in db:
    print("Number of mutual friends = ", 
    num_mutual_friends(db[friend1], db[friend2]))
else:
    print("Person not found")