11

How can I convert my own dataset to be usable by pytorch geometric for a graph neural network?

All the tutorials use existing dataset already converted to be usable by pytorch. For example if I have my own pointcloud dataset how can i use it to train for classification with graph neural network? What about my own image dataset for classification?

Sparky05
  • 4,692
  • 1
  • 10
  • 27
Worthless Fella
  • 129
  • 1
  • 6

4 Answers4

3

How you need to transform your data depends on what format your model expects.

Graph neural networks typically expect (a subset of):

  • node features
  • edges
  • edge attributes
  • node targets

depending on the problem. You can create an object with tensors of these values (and extend the attributes as you need) in PyTorch Geometric wth a Data object like so:

data = Data(x=x, edge_index=edge_index, y=y)
data.train_idx = torch.tensor([...], dtype=torch.long)
data.test_mask = torch.tensor([...], dtype=torch.bool)
iacob
  • 20,084
  • 6
  • 92
  • 119
3

just like mentioned in the document. pytorch-geometric

Do I really need to use these dataset interfaces? No! Just as in regular PyTorch, you do not have to use datasets, e.g., when you want to create synthetic data on the fly without saving them explicitly to disk. In this case, simply pass a regular python list holding torch_geometric.data.Data objects and pass them to torch_geometric.loader.DataLoader

from torch_geometric.data import Data
from torch_geometric.loader import DataLoader

data_list = [Data(...), ..., Data(...)]
loader = DataLoader(data_list, batch_size=32)
0
from torch_geometric.data import Dataset, Data
class MyCustomDataset(Dataset):
    def __init__():
        self.filename = .. # List of raw files, in your case point cloud
        super(MyCustomDataset, self).__init()

    @property
    def raw_file_names(self):
        return self.filename
    
    @property
    def processed_file_names(self):
        """ return list of files should be in processed dir, if found - skip processing."""
        processed_filename = []
        return processed_filename
    def download(self):
        pass

    def process(self):
        for file in self.raw_paths:
            self._process_one_step(file)

    def _process_one_step(self, path):
        out_path = (self.processed_dir, "some_unique_filename.pt")
        # read your point cloud here, 
        # convert point cloud to Data object
        data = Data(x=node_features,
                    edge_index=edge_index,
                    edge_attr=edge_attr,
                    y=label #you can add more arguments as you like
                    )
        torch.save(data, out_path)
        return

    def __len__(self):
        return len(self.processed_file_names)

    def __getitem__(self, idx):
        data = torch.load(os.path.join(self.processed_dir, self.processed_file_names[idx]))
        return data

This will create data in right format. Then you can use torch_geometric.data.Dataloader to create a dataloader and then train your network.

Harsh
  • 77
  • 2
  • 8
  • Can i ask, what if you don't have the data as a set of files? I asked this question here: https://stackoverflow.com/questions/72571841/pytorch-data-loader-getattr-attribute-name-must-be-string and I feel like the discussion here kind of helps me, but I can't fully see the link between the two. – Slowat_Kela Jun 10 '22 at 10:13
0
from torch_geometric.data import Data
edge_index = torch.from_numpy(graph_df[["source", "target"]].to_numpy())
x = torch.from_numpy(np.array(embedding_df["vectors"].tolist()))

data = Data(x=x, edge_index=edge_index.T)
data

You can create graph data like this