2

My data set consists of hundreds of .csv files with a fixed number of columns and variable numbers of rows. The question is - how to read it into tensorflow?

filename_queue = tf.train.string_input_producer(['file1.csv','file2.csv'])
features_reader = tf.WholeFileReader()
filename, value = features_reader.read(filename_queue)

Now it would be great to have some method to decode value into an actual numbers that are in it. Is there a way to do it, or should I use a different reader instead?

Karaszka
  • 306
  • 3
  • 12

1 Answers1

1

So in fact I solved this question with a different reader, by creating tf.records - tensorflow binaries and I think this is generally a way to go in such case.

While the official documentation of handling tf.records is not satisfying, there is a great explanation here: http://web.stanford.edu/class/cs20si/lectures/notes_09.pdf.

First one needs to read the file and convert it to binary format. In my case I just read the file to a numpy array.

file = your_custom_reader(csv_file)
file = file.tobytes()

Now, in my case the number of columns was constant, but the number of rows variable in the data set. This can be tricky - while you read binaries in, they come as tensors with no predefined shape (In the example from notes the shape is stored in the binary, but this still means that you need to evaluate it in the session, which makes it useless for constructing the model). Therefore at this step it's useful to pad your tensors to the maximal size.

file = your_custom_reader(csv_file)
file = pad_to_max_size(file)
file = file.tobytes()

Writing to a tf.record is easy. Given that for each file you have a label y:

writer = tf.python_io.TFRecordWriter(file_name)
example = tf.train.Example(features=tf.train.Features(feature={
    'features': tf.train.Feature(bytes_list=tf.train.BytesList(value=[file])),
    'y'       : tf.train.Feature(bytes_list=tf.train.BytesList(value=[y.tobytes()]))
    }))
writer.write(example.SerializeToString())
writer.close()

Now, the binary can be loaded as follow

tfrecord_file_queue = tf.train.string_input_producer([file_name, file_name_2,...,file_name_N], name='queue')
reader = tf.TFRecordReader()
_, tfrecord_serialized = reader.read(tfrecord_file_queue)
tfrecord_features = tf.parse_single_example(tfrecord_serialized,
                    features={
                        'features': tf.FixedLenFeature([],tf.string),
                        'y'       : tf.FixedLenFeature([],tf.string)                                                   
                                },  name='tf_features')

As I said, for the rest of the code it's important to know the shape of your tensor. Mine was SHAPE_1 and SHAPE_2

features = tf.decode_raw(tfrecord_features['features'],tf.float32)
features = tf.reshape(audio_features, (SHAPE_1,SHAPE_2))
features.set_shape((SHAPE_1,SHAPE_2))
y = tf.decode_raw(tfrecord_features['y'],tf.float32)

The more organized example that puts the code into functions is available in the lecture slides from Stanford I liked above. I recommend these slides a lot, especially since they provide more explanation where this answer is lacking. Still, I hope it helps!

Karaszka
  • 306
  • 3
  • 12