What is an efficient way to make a dataset and dataloader for high frequency time series with multiple individuals?

Question

I'm trying to forecast high frequency time series using LSTMs and PyTorch library. I'm going through PyTorch tutorial for creating custom datasets and models and figured out how to create my Dataset class and my Dataloader and they work perfectly fine but they take too much time to generate one batch.

I want to generate batches of fixed size, each batch contains time series from different individuals and the input window is of the same length as the output window (multi-step prediction).

I think the issue is due to the fact that I'm verifying the windows are correct.

My dataframe of a little bit more than 3M lines with 6 columns. I have some 100 individuals and for each individual I have 4 different time series $y_{1}$, $y_{2}$, $y_{3}$ and $y_{4}$. I have no missing values at all and the time steps are consecutive. For each individual I have the same time steps.

My code is:

class TSDataset(Dataset):
  
  def __init__(self, train_data, unique_column = 'unique_id', input_length = 3840, target_length = 3840, targets = ['y1', 'y2', 'y3', 'y4'], transform = None):
    self.train_data = train_data
    self.unique_column = unique_column
    self.input_length = input_length
    self.target_length = target_length
    self.total_window_length = input_length + target_length
    self.targets = targets
    
  def __len__(self):
    return len(self.train_data)

  def verify_time_steps(self, idx):
    change = False

    # Check if the window doesn't overlap over many individuals
    num_individuals = self.train_data.iloc[np.arange(idx + self.total_window_length), :][self.unique_column].unique().shape[0]

    if num_stations != 1:
      change = True

    if idx + self.total_window_length >= len(self.train_data):
      change = True

    return change

  def reshuffle(self):
    return np.random.randint(0, len(self.train_data))

  def __getitem__(self, idx):
    if torch.is_tensor(idx):
      idx = idx.tolist()

    change = self.verify_time_steps(idx)
    if change == True:
      while change != False:
        idx = self.reshuffle()
        change = self.verify_time_steps(idx)

    sample = self.train_data.iloc[np.arange(idx, idx + self.input_length), :][self.targets].values
    labels = self.train_data.iloc[np.arange(idx + self.input_length, idx + self.input_length + self.target_length), :][self.targets].values

    sample = torch.from_numpy(sample)
    labels = torch.from_numpy(labels)

    return sample, labels

I've tried using the TimeSeriesDataset from PyTorchForecasting but I had a hard time creating models that suit it.

I've also tried creating the dataset outside, as a numpy array but my RAM can't handle it.

Hope you can help me figure out how to alleviate the computations.

What is an efficient way to make a dataset and dataloader for high frequency time series with multiple individuals?

0 Answers0