0

I am currently working on a side project that attempts to predict a numerical response from a set of features encoded as time series data. For the following sample code shown below:

import numpy as np
import torch
import torch.utils.data as data
import torch.nn as nn

import collections

def get_max_length(x):
    return len(max(x, key=len))

def pad_sequence(seq):
    def _pad(_it, _max_len):
        return [0] * (_max_len - len(_it)) + _it
    return [_pad(it, get_max_length(seq)) for it in seq]

def custom_collate(batch):
    transposed = zip(*batch)
    lst = []
    for samples in transposed:
        if isinstance(samples[0], int):
            lst.append(torch.LongTensor(samples))
        elif isinstance(samples[0], float):
            lst.append(torch.DoubleTensor(samples))
        elif isinstance(samples[0], collections.abc.Sequence):
            lst.append(torch.DoubleTensor(pad_sequence(samples)))
    return lst

class SampleDataLoader(data.Dataset):
    def __init__(self, train = True):
        self.X, self.y  = [], []
        self.seed, self.test_size = 0, 0.2
        self.train = train
        self.num_experiments = 1000
        self.num_features_per_experiment = 3

        np.random.seed(self.seed)
        for _ in range(self.num_experiments):
            _id = np.random.rand()
            if self.train and _id <= self.test_size:
                continue
            if not self.train and _id > self.test_size:
                continue
            num_timesteps_per_experiment = np.random.randint(5, 200)
            response = np.random.randint(1, 100)
            self.X.append(np.random.rand(self.num_features_per_experiment, num_timesteps_per_experiment).tolist()[0])
            self.y.append(response)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dataset, batch_size = SampleDataLoader(train = True), 16
train_dataloader = torch.utils.data.dataloader.DataLoader(dataset = train_dataset,                                                         
                                                        batch_size = batch_size,                                            
                                                        collate_fn = custom_collate,
                                                        shuffle = True)

I defined SampleDataLoader which can generate both training and validation sets for my model by respectively passing train = True and train = False when instantiating the class. There are 1000 experiments that are captured in this data loader object. In each experiment, 3 features are captured across some variable number of timesteps between 5 to 200. My goal is to create a model using RNNs or seq2seq models to predict response value chosen between 1 to 100.

I am having some trouble trying to integrate these tutorials (https://github.com/bentrevett/pytorch-seq2seq) for the SampleDataLoader object that I created. I have already addressed the first issue of variable input size by padding the inputs with zeros for each batch.

Any help would be greatly appreciated! Thanks!

EDIT: For added transparency, I am trying to predict a sequence using seq2seq, and then generate a prediction by summing the outputs or creating another model (perhaps a FC layer) to generate the prediction. I am referencing this link for my inspiration of my logic behind this problem: https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html.

Thanks!

Wilson
  • 253
  • 2
  • 9

0 Answers0