1

This question has been answered for Tensorflow 1, eg: How to Properly Combine TensorFlow's Dataset API and Keras?, but this answer hasn't helped for my use case.

Below is an example of a model with three float32 inputs and one float32 output. I have a large amount of data that doesn't all fit into memory at once, so it's split into separate files. I'm trying to use the Dataset API to train a model by bringing in a portion of the training data at once.

import tensorflow as tf
import tensorflow.keras.layers as layers
import numpy as np

# Create TF model of a given architecture (number of hidden layers, layersize, #outputs, activation function)
def create_model(h=2, l=64, activation='relu'):
    model = tf.keras.Sequential([
        layers.Dense(l, activation=activation, input_shape=(3,), name='input_layer'),
        *[layers.Dense(l, activation=activation) for _ in range(h)],
        layers.Dense(1, activation='linear', name='output_layer')])
    return model

# Load data (3 X variables, 1 Y variable) split into 5 files
# (for this example, just create a list 5 numpy arrays)
list_of_training_datasets = [np.random.rand(10,4).astype(np.float32) for _ in range(5)]
validation_dataset = np.random.rand(30,4).astype(np.float32)

def data_generator():
    for data in list_of_training_datasets:
        x_data = data[:, 0:3]
        y_data = data[:, 3:4]
        yield((x_data,y_data))

# prepare model
model = create_model(h=2,l=64,activation='relu')
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam())

# load dataset
dataset = tf.data.Dataset.from_generator(data_generator,(np.float32,np.float32))

# fit model
model.fit(dataset, epochs=100, validation_data=(validation_dataset[:,0:3],validation_dataset[:,3:4]))

Running this, I get the error:

ValueError: Cannot take the length of shape with unknown rank.

Does anyone know how to get this working? I would also like to be able to use the batch dimension, to load two data files at a time, for example.

maurera
  • 1,519
  • 1
  • 15
  • 28

2 Answers2

0

You need to need to specify the shapes of the your dataset along with the return data types like this.

dataset = tf.data.Dataset.from_generator(data_generator,
                                         (np.float32,np.float32),
                                         ((None, 3), (None, 1)))
Srihari Humbarwadi
  • 2,532
  • 1
  • 10
  • 28
  • This results in the error "ValueError: `batch_size` or `steps` is required for `Tensor` or `NumPy` input data." Then if I specify either batch_size=1 or steps=1 I get subsequent errors. If I specify batch_size=1 I get the error "ValueError: The `batch_size` argument must not be specified for the given input type." If I specify steps=1 I get the error "TypeError: Unrecognized keyword arguments: {'steps': 1}" – maurera Mar 16 '20 at 21:29
0

The following works, but I don't know if this is the most efficient.

As far as I understand, if your training dataset is split into 10 pieces, then you should set steps_per_epoch=10. This ensures that each epoch will step through all data once. As far as I understand, dataset.repeat() is needed because the dataset iterator is "used up" after the first epoch. .repeat() ensures that the iterator gets created again after being used up.

import numpy as np
import tensorflow.keras.layers as layers
import tensorflow as tf

# Create TF model of a given architecture (number of hidden layers, layersize, #outputs, activation function)
def create_model(h=2, l=64, activation='relu'):
    model = tf.keras.Sequential([
        layers.Dense(l, activation=activation, input_shape=(3,), name='input_layer'),
        *[layers.Dense(l, activation=activation) for _ in range(h)],
        layers.Dense(1, activation='linear', name='output_layer')])
    return model

# Load data (3 X variables, 1 Y variable) split into 5 files
# (for this example, just create a list 5 numpy arrays)
list_of_training_datasets = [np.random.rand(10,4).astype(np.float32) for _ in range(5)]
steps_per_epoch = len(list_of_training_datasets)
validation_dataset = np.random.rand(30,4).astype(np.float32)

def data_generator():
    for data in list_of_training_datasets:
        x_data = data[:, 0:3]
        y_data = data[:, 3:4]
        yield((x_data,y_data))

# prepare model
model = create_model(h=2,l=64,activation='relu')
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam())

# load dataset
dataset = tf.data.Dataset.from_generator(data_generator,output_types=(np.float32,np.float32),
                output_shapes=(tf.TensorShape([None,3]), tf.TensorShape([None,1]))).repeat()

# fit model
model.fit(dataset.as_numpy_iterator(), epochs=10,steps_per_epoch=steps_per_epoch,
          validation_data=(validation_dataset[:,0:3],validation_dataset[:,3:4]))
maurera
  • 1,519
  • 1
  • 15
  • 28