I am currently working on a side project that attempts to predict a numerical response from a set of features encoded as time series data. For the following sample code shown below:
import numpy as np
import torch
import torch.utils.data as data
import torch.nn as nn
import collections
def get_max_length(x):
return len(max(x, key=len))
def pad_sequence(seq):
def _pad(_it, _max_len):
return [0] * (_max_len - len(_it)) + _it
return [_pad(it, get_max_length(seq)) for it in seq]
def custom_collate(batch):
transposed = zip(*batch)
lst = []
for samples in transposed:
if isinstance(samples[0], int):
lst.append(torch.LongTensor(samples))
elif isinstance(samples[0], float):
lst.append(torch.DoubleTensor(samples))
elif isinstance(samples[0], collections.abc.Sequence):
lst.append(torch.DoubleTensor(pad_sequence(samples)))
return lst
class SampleDataLoader(data.Dataset):
def __init__(self, train = True):
self.X, self.y = [], []
self.seed, self.test_size = 0, 0.2
self.train = train
self.num_experiments = 1000
self.num_features_per_experiment = 3
np.random.seed(self.seed)
for _ in range(self.num_experiments):
_id = np.random.rand()
if self.train and _id <= self.test_size:
continue
if not self.train and _id > self.test_size:
continue
num_timesteps_per_experiment = np.random.randint(5, 200)
response = np.random.randint(1, 100)
self.X.append(np.random.rand(self.num_features_per_experiment, num_timesteps_per_experiment).tolist()[0])
self.y.append(response)
def __len__(self):
return len(self.y)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
train_dataset, batch_size = SampleDataLoader(train = True), 16
train_dataloader = torch.utils.data.dataloader.DataLoader(dataset = train_dataset,
batch_size = batch_size,
collate_fn = custom_collate,
shuffle = True)
I defined SampleDataLoader
which can generate both training and validation sets for my model by respectively passing train = True
and train = False
when instantiating the class. There are 1000 experiments that are captured in this data loader object. In each experiment, 3 features are captured across some variable number of timesteps between 5 to 200. My goal is to create a model using RNNs or seq2seq models to predict response value chosen between 1 to 100.
I am having some trouble trying to integrate these tutorials (https://github.com/bentrevett/pytorch-seq2seq) for the SampleDataLoader
object that I created. I have already addressed the first issue of variable input size by padding the inputs with zeros for each batch.
Any help would be greatly appreciated! Thanks!
EDIT: For added transparency, I am trying to predict a sequence using seq2seq, and then generate a prediction by summing the outputs or creating another model (perhaps a FC layer) to generate the prediction. I am referencing this link for my inspiration of my logic behind this problem: https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html.
Thanks!