0

I have a train_x.csv and a train_y.csv, and I'd like to train a model using Dataset API and Keras interface. This what I'm trying to do:

import numpy as np
import pandas as pd
import tensorflow as tf

tf.enable_eager_execution()

N_FEATURES = 10
N_SAMPLES = 100
N_OUTPUTS = 2
BATCH_SIZE = 8
EPOCHS = 5

# prepare fake data
train_x = pd.DataFrame(np.random.rand(N_SAMPLES, N_FEATURES))
train_x.to_csv('train_x.csv', index=False)
train_y = pd.DataFrame(np.random.rand(N_SAMPLES, N_OUTPUTS))
train_y.to_csv('train_y.csv', index=False)

train_x = tf.data.experimental.CsvDataset('train_x.csv', [tf.float32] * N_FEATURES, header=True)
train_y = tf.data.experimental.CsvDataset('train_y.csv', [tf.float32] * N_OUTPUTS, header=True)
dataset = ...  # What to do here?

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(N_OUTPUTS, input_shape=(N_FEATURES,)),
    tf.keras.layers.Activation('linear'),
])
model.compile('sgd', 'mse')
model.fit(dataset, steps_per_epoch=N_SAMPLES/BATCH_SIZE, epochs=EPOCHS)

What's the right way to implement this dataset?

I tried Dataset.zip API like dataset = tf.data.Dataset.zip((train_x, train_y)) but it seems not working(code here and error here). I also read this answer, it's working but I'd like a non-functional model declaration way.

Icyblade
  • 223
  • 1
  • 5
  • 11

1 Answers1

2

The problem is in the input shape of your dense layer. It should match shape of your input tensor, which is 1. tf.keras.layers.Dense(N_OUTPUTS, input_shape=(features_shape,))

Also you might encounter problems defining model.fit() steps_per_epoch parameter, it should be of type int. model.fit(dataset, steps_per_epoch=int(N_SAMPLES/BATCH_SIZE), epochs=EPOCHS)

Edit 1: In case you need multiple labels, you can do

def parse_f(data, labels):
    return data, tf.stack(labels, axis=0)

dataset = tf.data.Dataset.zip((train_x, train_y))
dataset = dataset.map(parse_func)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.repeat()
Sharky
  • 4,473
  • 2
  • 19
  • 27
  • But by design the input shape should be N_FEATURES not 1, as my train_x has shape (N_SAMPLES, N_FEATURES). – Icyblade Feb 24 '19 at 14:20
  • That's right, input shape is (batch, input), input is itself 1d array. Have you tried the code from my answer? – Sharky Feb 24 '19 at 15:29
  • You mean `input_shape=(1, )`? Not working I'm afraid. – Icyblade Feb 24 '19 at 15:41
  • Are you getting the following error `ValueError: Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected...` ? If so, there is a shape mismatch in your random data, it should be `train_y = pd.DataFrame(np.random.rand(N_SAMPLES))` see this answer https://stackoverflow.com/questions/43899248/keras-model-valueerror-error-when-checking-model-target – Sharky Feb 24 '19 at 16:19
  • But my network should have two outputs, that's why the N_OUTPUTS exists. And the `train_y` cannot be a 1-dimensional array. – Icyblade Feb 25 '19 at 11:08
  • `pd.DataFrame(np.random.rand(N_SAMPLES))` will create a dataframe of shape `[N_SAMPLES rows x 1 columns]`, so one label per training example. Is your situation different? – Sharky Feb 25 '19 at 13:57
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/189001/discussion-between-icyblade-and-sharky). – Icyblade Feb 25 '19 at 14:01