I have been searching for an answer as to how I should go about this for quite some time and can't seem to find anything that works.
I am following a tutorial on using the tf.data API found here. My scenario is very similar to the one in this tutorial (i.e. I have 3 directories containing all the training/validation/test files), however, they are not images, they're spectrograms saved as CSVs.
I have found a couple solutions for reading lines of a CSV where each line is a training instance (e.g., How to *actually* read CSV data in TensorFlow?). But my issue with this implementation is the required record_defaults
parameter as the CSVs are 500x200.
Here is what I was thinking:
import tensorflow as tf
import pandas as pd
def load_data(path, label):
# This obviously doesn't work because path and label
# are Tensors, but this is what I had in mind...
data = pd.read_csv(path, index_col=0).values()
return data, label
X_train = tf.constant(training_files) # training_files is a list of the file names
Y_train = tf.constant(training_labels # training_labels is a list of labels for each file
train_data = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
# Here is where I thought I would do the mapping of 'load_data' over each batch
train_data = train_data.batch(64).map(load_data)
iterator = tf.data.Iterator.from_structure(train_data.output_types, \
train_data.output_shapes)
next_batch = iterator.get_next()
train_op = iterator.make_initializer(train_data)
I have only used Tensorflows feed_dict
in the past, but I need a different approach now that my data has gotten to the size that it can no longer fit in memory.
Any thoughts? Thanks.