0

I am currently trying to get a simple tensorflow model to train by data provided by a custom input pipeline. It should work as efficient as possible. Although I've read lots of tutorials, I can't get it to work.

THE DATA

I have my training data split over several csv files. File 'a.csv' has 20 samples and 'b.csv' has 30 samples in it, respectively. They have the same structure with the same header:

feature1; feature2; feature3; feature4
0.1; 0.2; 0.3; 0.4
...

(No labels, as it is for an autoencoder.)

THE CODE

I have written an input pipeline and would like to feed the data from it to the model. My code looks like this:

import tensorflow as tf

def input_pipeline(filenames, batch_size):
    dataset = tf.data.Dataset.from_tensor_slices(filenames)

    dataset = dataset.flat_map(
        lambda filename: (
                tf.data.TextLineDataset(filename)
                 .skip(1)
                 .shuffle(10)
                 .map(lambda csv_row: tf.decode_csv(
                         csv_row, 
                         record_defaults=[[-1.0]]*4,
                         field_delim=';'))
                 .batch(batch_size)     
        )
    )

    return dataset.make_initializable_iterator()


iterator = input_pipeline(['/home/sku/data/a.csv', 
                           '/home/sku/data/b.csv'], 
                           batch_size=5)

next_element = iterator.get_next()


# Build the autoencoder
x = tf.placeholder(tf.float32, shape=[None, 4], name='in')

z = tf.contrib.layers.fully_connected(x, 2, activation_fn=tf.nn.relu)

x_hat = tf.contrib.layers.fully_connected(z, 4)

# loss function with epsilon for numeric stability
epsilon = 1e-10
loss = -tf.reduce_sum(
    x * tf.log(epsilon + x_hat) + (1 - x) * tf.log(epsilon + 1 - x_hat))

train_op = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(loss)

with tf.Session() as sess:
    sess.run(iterator.initializer)
    sess.run(tf.global_variables_initializer())

for i in range(50):
    batch = sess.run(next_element)
    sess.run(train_op, feed_dict={x: batch, x_hat: batch})

THE PROBLEM

When trying to feed the data to the model, I get an error:

ValueError: Cannot feed value of shape (4, 5) for Tensor 'in:0', which has shape '(?, 4)'

When printing out the shapes of the batched data, I get this for example:

(array([ 4.1,  5.9,  5.5,  6.7, 10. ], dtype=float32), array([0.4, 7.7, 0. , 3.4, 8.7], dtype=float32), array([3.5, 4.9, 8.3, 7.2, 6.4], dtype=float32), array([-1. , -1. ,  9.6, -1. , -1. ], dtype=float32))

It makes sense, but where and how do I have to reshape this? Also, this additional info dtype only appears with batching.

I also considered that I did the feeding wrong. Do I need input_fn or something like that? I remember that feeding dicts is way to slow. If somebody could give me an efficient way to prepare and feed the data, I would be really grateful.

Regards,

DocDriven
  • 3,726
  • 6
  • 24
  • 53

1 Answers1

0

I've figured out a solution, that requires a second mapping function. You have to add the following line to the input function:

def input_pipeline(filenames, batch_size):
    dataset = tf.data.Dataset.from_tensor_slices(filenames)

    dataset = dataset.flat_map(
        lambda filename: (
                tf.data.TextLineDataset(filename)
                 .skip(1)
                 .shuffle(10)
                 .map(lambda csv_row: tf.decode_csv(
                         csv_row, 
                         record_defaults=[[-1.0]]*4,
                         field_delim=';'))
                 .map(lambda *inputs: tf.stack(inputs))  # <-- mapping required
                 .batch(batch_size)     
        )
    )

    return dataset.make_initializable_iterator()

This seems to convert the array-like output to a matrix, that can be fed to the network.

However, I'm still not sure if feeding it via feed_dict is the most efficient way. I'd still appreciate support here!

DocDriven
  • 3,726
  • 6
  • 24
  • 53