1

Time-series data have dimensions below C: Channel L: Length of time(sequence)

I have L is over 200000 and C is over 200.

In dataloader, data are

def __getitems(self, idx):
 return  data # data=(B, C, W), B:Batch size, W=sliding window size

The original data size is approximately

20^52003(float32) = 120Mb. But, I need a large sliding window such as over 512, then the data size with rolling window is 10^5 * 210^2 * 2^8 3(float32) [byte] = 32^910^2 =60Gb

There are several ways to prepare data

  1. Save and Load
  • split the data into a small size and make rolling window data

      div = d1*d2*self.ls *3 // (1024*1024*1024) #1GB
      len_chunck = d1//div //self.save_batch * self.save_batch
      setL = 0
      for k in tqdm(range(div), desc=f"data gen... idx={i+1}/{len(self.set_info)}") :
          data_split = data[k*len_chunck:(k+1)*len_chunck+self.ls]
          div_data = torch.Tensor(to_roll_window(data_split, self.ls).astype(float))
          div_data = div_data[:div_data.shape[0]//self.save_batch*self.save_batch]
          L, self.C, self.T = div_data.shape
          for j, idx in enumerate(range(0, L, self.save_batch)):
              #save it into the storage
              fname = f'{i}_{j+setL:04d}.npy' #save it into the storage
              np.save(os.path.join(self.data_basepath, fname), div_data[idx:idx+self.save_batch])
    
  • but, it needs lots of reading time

  1. make data at every getitem() in dataloader

    def getitem(self, idx): d = torch.Tensor(np.empty([B, C, W])) for j, i in enumerate(idx): d[j] = self.data_transposed[:, i:i+W]

#data = (L, C) #data_transposed = transposed data (C, L)

I think the second way is more efficient, but it also consumes time...

Is there any fancy way to solve this problem?

SangGyu
  • 11
  • 1

0 Answers0