Time-series data have dimensions below C: Channel L: Length of time(sequence)
I have L is over 200000 and C is over 200.
In dataloader, data are
def __getitems(self, idx):
return data # data=(B, C, W), B:Batch size, W=sliding window size
The original data size is approximately
20^52003(float32) = 120Mb. But, I need a large sliding window such as over 512, then the data size with rolling window is 10^5 * 210^2 * 2^8 3(float32) [byte] = 32^910^2 =60Gb
There are several ways to prepare data
- Save and Load
split the data into a small size and make rolling window data
div = d1*d2*self.ls *3 // (1024*1024*1024) #1GB len_chunck = d1//div //self.save_batch * self.save_batch setL = 0 for k in tqdm(range(div), desc=f"data gen... idx={i+1}/{len(self.set_info)}") : data_split = data[k*len_chunck:(k+1)*len_chunck+self.ls] div_data = torch.Tensor(to_roll_window(data_split, self.ls).astype(float)) div_data = div_data[:div_data.shape[0]//self.save_batch*self.save_batch] L, self.C, self.T = div_data.shape for j, idx in enumerate(range(0, L, self.save_batch)): #save it into the storage fname = f'{i}_{j+setL:04d}.npy' #save it into the storage np.save(os.path.join(self.data_basepath, fname), div_data[idx:idx+self.save_batch])
but, it needs lots of reading time
make data at every getitem() in dataloader
def getitem(self, idx): d = torch.Tensor(np.empty([B, C, W])) for j, i in enumerate(idx): d[j] = self.data_transposed[:, i:i+W]
#data = (L, C) #data_transposed = transposed data (C, L)
I think the second way is more efficient, but it also consumes time...
Is there any fancy way to solve this problem?