0

I have a dataset containing 100 samples with dimensions (5000,2) means the initial dataset shape is (100,5000,2), (assumed numbers to make the example clear, the intended dataset is much bigger than that) Now each of the samples is pre-processed with a function (F) and from each sample 100 new sample is generated, therefore, the final shape of the data set will be (10000, 5000, 2) for input (X) and (10000, 1) for output (Y). The problem is that due to the RAM limitation, I cannot pre-process the whole data at once. That's why I searched, it seems that I should use tf.data. Now the question I have is, in which step should I apply that "F" function? At first, I tried to do it with dataset.map(), but I didn't succeed, then I tried tf.data.Dataset.from_generator() and I used the function F as a generator, now the problem I have is that a dimension is added to the data set, and the dataset becomes (1,10000,5000,2) and (1,10000,1) as if the whole dataset is defined as one sample If anyone knows what I should do, I would appreciate it

note: in fact, each initial data sample doesn't have any label, but the F function gives raw samples and produces 1000*n sample with an associated label: Initial_X -> F_function -> x,y

here is the pseudocode:

Initial_X=np.random.rand(100,5000,2)

def F_function(input):
  x= np.random.rand(100*input.shape[0],input.shape[1],input.shape[2])
  y=np.arange(100*Initial_X.shape[0])[:,np.newaxis]
  return x,y


def data_generator():
    x,y=F_function(Initial_X)
    yield(x, y)

def get_dataset():
  dataset = tf.data.Dataset.from_generator(
    generator=data_generator,
    output_types=(tf.float64, tf.float64)

  )

  dataset = dataset.batch(32)
  train_dataset = dataset.take(int(0.8*10000))
  test_dataset = dataset.skip(int(0.8*10000))

  return train_dataset, test_dataset
train_dataset, test_dataset=get_dataset()


for i, ex in enumerate(train_dataset):
    print(i, ex)

but returns

0 (<tf.Tensor: shape=(1, 10000, 5000, 2), dtype=float64, numpy=
array([[[[9.82932481e-01, 6.58260152e-02],
...,
[7.17173551e-03, 2.06494299e-01]]]])>, <tf.Tensor: shape=(1, 10000, 1), dtype=float64, numpy=
array([[[0.000e+00],
        ...,
        [9.999e+03]]])>)

expected to have samples with shape of (5000, 2) and associated labels

update:

I added dataset = dataset.unbatch() line as follow:

def get_dataset():
  dataset = tf.data.Dataset.from_generator(
    generator=data_generator,
    output_types=(tf.float64, tf.float64)

  )
  dataset = dataset.unbatch()

  dataset = dataset.batch(32)
  train_dataset = dataset.take(int(0.8*10000))
  test_dataset = dataset.skip(int(0.8*10000))

  return train_dataset, test_dataset

and the dataset shape problem was solved. However, I tried .from_generator() to deal with MEMORY LIMITATION and pre-process date using F function in a stream way, But it seems I was wrong because I still have the MEMORY issue, unfortunately. Any suggestions for dealing with this MEMORY problem?! isn't using .from_generator() wrong for my case?

SadeqK
  • 1
  • 1

1 Answers1

0

here you have the solution.

import numpy as np
import tensorflow as tf

Initial_X=np.random.rand(100,5000,2)
N_EXPANSION = 100

def process_your_input_after_expanded(x):
  #do whatever you'd like to your input
  return x

def create_new_samples(x):
  expanded_ds = tf.data.Dataset.range(0,N_EXPANSION)
  expanded_ds = expanded_ds.map(lambda _: process_your_input_after_expanded(x))
  return expanded_ds

def get_dataset():
  dataset = tf.data.Dataset.from_tensor_slices(tensors=Initial_X)
  dataset = dataset.interleave(lambda x: create_new_samples(x))
  dataset = dataset.batch(1)
  return dataset

train_dataset = get_dataset()

for (i, ex) in enumerate(train_dataset):
    print(i, ex.shape)

Basically it's expanding your input (N=100) by a N_EXPANSION reason (100), iteratively augmenting each sample in n expanded samples. You end up with 100*100 = 10000 samples (if batch = 1)

You can just copy that into a colab and it should work out of the box.

Best of luck!

Tiago Santos
  • 726
  • 6
  • 13