efficient way to prepare time series with sliding window (save&read or make data every itr)

Question

Time-series data have dimensions below C: Channel L: Length of time(sequence)

I have L is over 200000 and C is over 200.

In dataloader, data are

def __getitems(self, idx):
 return  data # data=(B, C, W), B:Batch size, W=sliding window size

The original data size is approximately

20^52003(float32) = 120Mb. But, I need a large sliding window such as over 512, then the data size with rolling window is 10^5 * 210^2 * 2^8 3(float32) [byte] = 32^910^2 =60Gb

There are several ways to prepare data

Save and Load

split the data into a small size and make rolling window data

  div = d1*d2*self.ls *3 // (1024*1024*1024) #1GB
  len_chunck = d1//div //self.save_batch * self.save_batch
  setL = 0
  for k in tqdm(range(div), desc=f"data gen... idx={i+1}/{len(self.set_info)}") :
      data_split = data[k*len_chunck:(k+1)*len_chunck+self.ls]
      div_data = torch.Tensor(to_roll_window(data_split, self.ls).astype(float))
      div_data = div_data[:div_data.shape[0]//self.save_batch*self.save_batch]
      L, self.C, self.T = div_data.shape
      for j, idx in enumerate(range(0, L, self.save_batch)):
          #save it into the storage
          fname = f'{i}_{j+setL:04d}.npy' #save it into the storage
          np.save(os.path.join(self.data_basepath, fname), div_data[idx:idx+self.save_batch])

but, it needs lots of reading time

make data at every getitem() in dataloader

def getitem(self, idx): d = torch.Tensor(np.empty([B, C, W])) for j, i in enumerate(idx): d[j] = self.data_transposed[:, i:i+W]

#data = (L, C) #data_transposed = transposed data (C, L)

I think the second way is more efficient, but it also consumes time...

Is there any fancy way to solve this problem?

efficient way to prepare time series with sliding window (save&read or make data every itr)

0 Answers0