1

I do have a keras CNN model ready which expects [None,20,20,3] arrays as input. (20 is image size here...) On the other side I do have a CSV with 1200 (20*20*3) columns ready in my cloud storage.

I want to write an ETL pipeline with tensorflow to obtain a [20,20,3] shape tensor for each row in the csv.

My code so far:

I spent days of work already and feel confident, that this approoach might work out in the end.

import tensorflow as tf

BATCH_SIZE = 30

tf.enable_eager_execution()

X_csv_path = 'gs://my-bucket/dataX.csv'


X_dataset = tf.data.experimental.make_csv_dataset(X_csv_path, BATCH_SIZE, column_names=range(1200) , header=False)
X_dataset = X_dataset.map(lambda x: tf.stack(list(x.values())))

iterator = X_dataset.make_one_shot_iterator()
image = iterator.get_next()

I would expect to have a [30,1200] shape but I still get 1200 tensors of shape [30] instead. My idea is to read every line into a [1200] shaped tensor and then reshape the line to a [20,20,3] tensor to feed my model with. Thanks for your time!

ChriSquared
  • 499
  • 4
  • 6

1 Answers1

0

tf.data.experimental.make_csv_dataset creates a OrderedDict of column arrays. For your task I'd use tf.data.TextLineDataset.

def parse(filename):
    string = tf.strings.split([filename], sep=',').values
    return string

dataset = tf.data.TextLineDataset('sample.csv').map(parse).batch(BATCH_SIZE)
for i in dataset:
    print(i)

This will output tensor of shape (BATCH_SIZE, row_length), where row_length is a row from csv file. You can apply any additional preprocessing, depending on your task

Sharky
  • 4,473
  • 2
  • 19
  • 27
  • Thank you very much Sharky! It worked! :) Is my approach also something one would consider best practise? My idea behind is that I dont want the images to be read every time I want to train a model, so I prepared all images and constructed me an CSV file of desired format already as the new starting point. Now I want to train different models based on this CSV in JupyterLab in google cloud. Any suggestions? – ChriSquared Apr 29 '19 at 06:55
  • Basically, if you have raw images as a dir of numpy arrays, it's better to use `from_tensor_slices`. Or, in some cases, you can convert them to a single tfrecords file – Sharky Apr 29 '19 at 07:06
  • Okay thanks, good to know! Is it okay to use a notebook instance on google cloud ml engine for learning models this way or would you recommend creating an python application (task.py, model.py ...) and use the gcloud ml-engine jobs submit training command from terminal? – ChriSquared Apr 29 '19 at 07:32
  • Sorry, can't advise on this matter. I think it will heavily depend on particular case – Sharky Apr 29 '19 at 09:55
  • Okay, tank you very much for your help and happy coding! – ChriSquared Apr 29 '19 at 18:50