I'm trying to understand how to use Stellargraph's EdgeSplitter
class. In particular, the examples on the documentation for training a link prediction model based on Node2Vec splits the graph in the following parts:
Distrution of samples across train, val and test set
Following the examples on the documentation, first you sample 10% of the links of the full graph in order to obtain the test set:
# Define an edge splitter on the original graph:
edge_splitter_test = EdgeSplitter(graph)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from graph, and obtain the
# reduced graph graph_test with the sampled links removed:
graph_test, examples_test, labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global"
)
As far as I understand from the docs, graph_test
is the original graph but with the test links removed. Then you perform the same operation with the training set,
# Do the same process to compute a training subset from within the test graph
edge_splitter_train = EdgeSplitter(graph_test)
graph_train, examples, labels = edge_splitter_train.train_test_split(
p=0.1, method="global"
)
Following the previous logic, graph_train
corresponds to graph_test
with the training links removed.
Further down the code, my understanding is that we use graph_train
to train the embedding and the training samples (examples, labels) to train the classifier. So I have several questions here:
- Why are we using disjoint sets of training data to train different parts of the model? Shouldn´t we train both the embedding and the classifier with the full training set of links?
- Why is the test set so big? Wouldn´t it be better to have most samples in the training set?
- What is the correct way of using the
EdgeSplitter
class?
Thanks you in advance for your help!