Splitting train test sets for Node2vec link prediction in Stellargraph

Question

I'm trying to understand how to use Stellargraph's EdgeSplitter class. In particular, the examples on the documentation for training a link prediction model based on Node2Vec splits the graph in the following parts:

Distrution of samples across train, val and test set

Following the examples on the documentation, first you sample 10% of the links of the full graph in order to obtain the test set:

# Define an edge splitter on the original graph:
edge_splitter_test = EdgeSplitter(graph)

# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from graph, and obtain the
# reduced graph graph_test with the sampled links removed:
graph_test, examples_test, labels_test = edge_splitter_test.train_test_split(
    p=0.1, method="global"
)

As far as I understand from the docs, graph_test is the original graph but with the test links removed. Then you perform the same operation with the training set,

# Do the same process to compute a training subset from within the test graph
edge_splitter_train = EdgeSplitter(graph_test)
graph_train, examples, labels = edge_splitter_train.train_test_split(
    p=0.1, method="global"
)

Following the previous logic, graph_train corresponds to graph_test with the training links removed.

Further down the code, my understanding is that we use graph_train to train the embedding and the training samples (examples, labels) to train the classifier. So I have several questions here:

Why are we using disjoint sets of training data to train different parts of the model? Shouldn´t we train both the embedding and the classifier with the full training set of links?
Why is the test set so big? Wouldn´t it be better to have most samples in the training set?
What is the correct way of using the EdgeSplitter class?

Thanks you in advance for your help!

score 0 · Answer 1 · answered Dec 21 '20 at 14:41

Why disjoint sets: This may or may not matter depending on the embedding algorithm. The risk with edges that are both seen by the embedding algorithm and the classifier as targets is that the embedding algorithm may encode non-generalizable features.

For example, theoretically one feature of the embedding could be the node id, and then you could have other features encoding the entire neighborhood of the node. When combining two node's embeddings into a link vector in a weird way, or when using a multilayer model, one could therefore create a binary feature which is 1 if the two nodes are connected during embedding training and 0 otherwise. In this case the classifier would perhaps just learn to use this trivial feature which is not present (i.e. has value 0) when you go to the test data.

The above would not happen in a real scenario, but more subtle features could have the same effect to a lesser degree. In the end, this only risks to make model selection bad. That is, the first split is to make the test reliable. The second split is to improve model selection. You can therefore omit the second split if you wish.

Why test set so big: You are likely to get higher score with a bigger train set. As long as the experiment is repeated with different splits and variance is under control, it should be fine to increase train size.

What is the correct way to use EdgeSplitter: I dont know what 'correct' means here. I think graph splitting is still an active research field.

Splitting train test sets for Node2vec link prediction in Stellargraph

1 Answers1