0

I am using the Jaccard Coefficient to predict links in a newtork and then get the AUC score of my prediction. My code works, but each time it gives me a different score because each time it randomly chooses different nodes as the training set. Let's say I want to run 1000 prediction scores, and store them, and then get the average of those scores. What would I need to add/change to my code?

INPUT


#Remove 20% of the edges
proportion_edges=.2
edge_subset = random.sample(G.edges(), int(proportion_edges*G.number_of_edges()))

#Create a copy of the graph and remove the edges
G_train = G.copy()
G_train.remove_edges_from(edge_subset)


#Make prediction using Jaccard Coefficient
pred_jaccard = list(nx.jaccard_coefficient(G_train))
score_jaccard, label_jaccard = zip(*[(s, (u,v) in edge_subset) for (u,v,s) in pred_jaccard])

#Compute the ROC AUC Score for Jaccard Coefficient
from sklearn import metrics
from sklearn.metrics import roc_auc_score

fpr_jaccard, tpr_jaccard, _ = metrics.roc_curve(label_jaccard, score_jaccard)
auc_jaccard = roc_auc_score(label_jaccard, score_jaccard)
auc_jaccard

OUTPUT

0.6926406926406927

1 Answers1

1

To simply answer your question: You would need to build a loop around your code:

# Settings
proportion_edges=.2
auc_jaccard_list = []

for i in range(1000):
    #Remove 20% of the edges
    edge_subset = random.sample(G.edges(), int(proportion_edges*G.number_of_edges()))
    # ...
    auc_jaccard = roc_auc_score(label_jaccard, score_jaccard)
    auc_jaccard_list.append(auc_jaccard)

# print results
print(np.mean(auc_jaccard_list))

Methodological side

From the methodological side, I would suggest to revise some details:

Definition of class 1 edges

You consider all pairs of nodes for your evaluation:

score_jaccard, label_jaccard = zip(*[(s, (u,v) in edge_subset) for (u,v,s) in pred_jaccard])

But only test-edges count as class 1. This means that all existing training edges are considered as class 0.
Doing so means to evaluate how well your method predicts, whether an edge is part of the randomly chosen edge set.

Suggestion: Create a test set that consists of randomly chosen pairs of nodes, independently of whether there is an edge or not. And only evaluate over these pairs. That will probably increase your auc.

Mixing training and testing

Removing edges for testing modifies also the training set and changes the jaccard coefficients of training and test set.

Suggestion: Unfortunately, it is difficult to come up with a good approach without knowing more of your use case.

Broele
  • 111
  • 3
  • I do not fully follow the methodological component. How would your suggestion "create a test set that consists of randomly chosen pairs of nodes, independently of whether there is an edge or not" look like in code? – Oscar Fernando CV Sep 11 '22 at 20:40
  • And what other information would you need to think of a better approach? I really appreciate your advise. – Oscar Fernando CV Sep 11 '22 at 20:41
  • If you look at a graph, each pair of nodes is either connected by an edge (existing edge) or not connected (non-existing edge). Currently, you choose 20% of the existing edges as test set and label them as class 1. All non-existing edges as well as the 80% of existing edges in the training set are labeled as class 0. So your evaluate over all possible edges, but label them in a strange way. Instead, you could select 20% of the existing and 20% of the non-existing edge to create a test set. – Broele Sep 11 '22 at 23:08
  • Overall, the setting is a bit uncommon, since you do not really learn a model, but plainly compute statistics. The typical setting of totally separated train/test datasets seems not to work, here, since both have to operate in the same graph. So I am not even sure if train/test-splits are the right way, here. For a better answer, I would need to understand, how this system would be used. What edges do you want to predict? Is it kind of a recommender system? Or are there new nodes added to the graph? – Broele Sep 11 '22 at 23:16
  • By the way: you might consider to ask such question in the [data science section](https://datascience.stackexchange.com/) – Broele Sep 11 '22 at 23:19