2

I have one graph, defined by 4 matrices: x (node features), y (node labels), edge_index (edges list) and edge_attr (edge features). I want to create a dataset in Pytorch Geometric with this single graph and perform node-level classification. It seems that just wrapping these 4 matrices into a data object fails, for some reason.

I have created a dataset containing the attributes:

Data(edge_attr=[3339730, 1], edge_index=[2, 3339730], x=[6911, 50000], y=[6911, 1])

representing a graph. If I try to slice this graph, like:

train_dataset, test_dataset = dataset[:5000], dataset[5000:]

I get the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-feb278180c99> in <module>
      3 # train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])
      4 
----> 5 train_dataset, test_dataset = dataset[:5000], dataset[5000:]
      6 
      7 # Create dataloader for training and test dataset.

~/anaconda3/envs/py38/lib/python3.8/site-packages/torch_geometric/data/data.py in __getitem__(self, key)
     92     def __getitem__(self, key):
     93         r"""Gets the data of the attribute :obj:`key`."""
---> 94         return getattr(self, key, None)
     95 
     96     def __setitem__(self, key, value):

TypeError: getattr(): attribute name must be string

What am I doing wrong in the data construction?

Qubix
  • 4,161
  • 7
  • 36
  • 73

2 Answers2

5

For node classification:

Create custom dataset.

class CustomDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(CustomDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])
        
    @property
    def raw_file_names(self):
        return ['edge_list.csv', 'x.pt', 'y.pt', 'edge_attributes.csv']

    @property
    def processed_file_names(self):
        return ['graph.pt']

    def process(self):
        data_list = []
        edge_list = pd.read_csv(self.raw_paths[0], dtype=int)
        target_nodes = edge_list.iloc[:,0].values
        source_nodes = edge_list.iloc[:,1].values
        edge_index = torch.tensor([source_nodes, target_nodes], dtype=torch.int64)

        x = torch.load(self.raw_paths[1], map_location=torch.device('cpu'))
        y = torch.load(self.raw_paths[2], map_location=torch.device('cpu'))

        # make masks
        n = x.shape[0]
        randomassort = list(range(n))
        random.shuffle(randomassort)
        max_train = floor(len(randomassort) * .1)
        train_mask_idx = torch.tensor(randomassort[:max_train])
        test_mask_idx = torch.tensor(randomassort[max_train:])
        train_mask = torch.zeros(n); test_mask = torch.zeros(n)
        train_mask.scatter_(0, train_mask_idx, 1)
        test_mask.scatter_(0, test_mask_idx, 1)
        train_mask = train_mask.type(torch.bool)
        test_mask = test_mask.type(torch.bool)

        edge_attributes = pd.read_csv(self.raw_paths[3])

        data = Data(edge_index=edge_index, x=x, y=y, train_mask=train_mask, test_mask=test_mask)

        print(data.__dict__)
        data, slices = self.collate([data])
        torch.save((data, slices), self.processed_paths[0])

Then in the train loop use the masks when updating the model.

def train():
    ...
    model.train()
    optimizer.zero_grad()
    F.nll_loss(model()[data.train_mask], data.y[data.train_mask]).backward()
    optimizer.step()
conv3d
  • 2,668
  • 6
  • 25
  • 45
  • What is "ch" in train_mask.scatter_(0, ch, 1); test_mask.scatter_(0, ch, 1) ? – Qubix Jan 17 '21 at 12:05
  • @Qubix whoops I think I renamed some variables when I moved the code over. Just edited. It should be the `idx`'s of the train and test masks – conv3d Jan 18 '21 at 19:01
  • There's a new function in utils that removes some of the boilerplate from this code, automatically converts mask indexes into the mask itself, skipping the scatter and astype steps: https://pytorch-geometric.readthedocs.io/en/latest/modules/utils.html?highlight=mask#torch_geometric.utils.index_to_mask – Zhi Yong Lee Jan 15 '23 at 04:32
1

You cannot slice a torch_geometric.data.Data as its __getitem__ is defined as:

def __getitem__(self, key):
    r"""Gets the data of the attribute :obj:`key`."""
    return getattr(self, key, None)

So it seems you can't access edges with the __getitem__. However, since what you are trying to do is split your dataset you could use torch_geometric.utils.train_test_split_edges. Something like:

torch_geometric.utils.train_test_split_edges(dataset, val_ratio=0.1, test_ratio=0)

It will:

split the edges of a your Data object into positive and negative train/val/test edges, and add the following attributes: train_pos_edge_index, train_neg_adj_mask, val_pos_edge_index, val_neg_edge_index, test_pos_edge_index, and test_neg_edge_index to the returned Data object.

Ivan
  • 34,531
  • 8
  • 55
  • 100
  • Hi, and thank you for answering. I've posted this question all over the internet, as it's rather urgent :( and I've received 1 single answer so far. Yours. I am looking at that function, but I don't know how I could incorporate it into my data object. Would it be possible for you to post a simple example that shows how that is done, if you are given just one single graph, in the form of one data object: Data(edge_attr=[3339730, 1], edge_index=[2, 3339730], x=[6911, 50000], y=[6911, 1]) – Qubix Jan 11 '21 at 18:42
  • You only want to use `train_test_split_edges` if you want to do your train-test split on the *edges* and not the *nodes*. To train-test split on nodes you want to use *masks* – conv3d Jan 11 '21 at 19:55
  • @jchaykow any idea how that is done, could you please post a minimal working example with a small graph? – Qubix Jan 11 '21 at 20:18