5

I have a dataset on my own, and the dataset contains two classes, let's say 0 and 1. Besides, there is a large part of nodes which class is unlabeled. My goal is to predict these unlabeled nodes using GCN. But I am confused about how to deal with these unlabeled nodes in Pytorch Geometric.

As far as I can think about, I can label the nodes into 3 classes, 0, 1 and unknown. But if I do it this way, that means I am trying to classify the dataset into three classes? (But that's not what I want since unknown is not a class).

And another way to deal with these node is to ignore them, simply run PyG on the labeled node. But in this way, it seems that these unlabeled node(with feature) is useless in the dataset?

Sparky05
  • 4,692
  • 1
  • 10
  • 27

1 Answers1

1

That very much depends on your use case and the data!

Case 1 - Graph Autoencoder

For this case let's assume the task is to find similar tweets. A way of doing this is to train a Graph Autoencoder (see example). This approach is completely unsupervised and thus does not need any data to be labeled.

The resulting model should be able to generate an embedding for each node (in this case each tweet) so that the distance between similar tweets is lower than between non-similar (measured e. g. by cosine distance).

Case 2 - Semi-Supervised GCN

Another case would be to classify tweets as advertisement vs. non-advertisement. Since the idea behind GCNs is to train in a semi-supervised manner it would be no problem to only have labels for some of the tweets.

In order to tell PYG which ones have labels and should be used for training you can define a train_mask. All nodes with missing labels will still technically need a y-value which can be set to -1.

Source & Credits