1

The documentation for CSV Datasets stops short of showing how to use a CSV dataset for anything practical like using the data to train a neural network. Can anyone provide a straightforward example to demonstrate how to do this, with clarity around data shape and type issues at a minimum, and preferably considering batching, shuffling, repeating over epochs as well?

For example, I have a CSV file of M rows, each row being an integer class label followed by N integers from which I hope to predict the class label using an old-style 3-layer neural network with H hidden neurons:

model = Sequential()
model.add(Dense(H, activation='relu', input_dim=N))
...
model.fit(train_ds, ...)

For my data, M > 50000 and N > 200. I have tried creating my dataset by using:

train_ds = tf.data.experimental.make_csv_dataset('mydata.csv`, batch_size=B)

However... this leads to compatibility problems between the dataset and the model... but it's not clear where these compatibility problems lie - are they in the input shape, the integer (not float) data, or somewhere else?

omatai
  • 3,448
  • 5
  • 47
  • 74
  • Add first few rows of the dataset. – Zabir Al Nazi May 05 '20 at 07:19
  • First row: "Label,X1,X2,X3,X4,....,X205" Subsequent rows: ",,,,..." where 0 <= k <= K for K classes, and is any (random) int, and there are 205 of them after the label. – omatai May 05 '20 at 09:41
  • Please, don't use semantics, post the actual file contents in the post in text. – Zabir Al Nazi May 05 '20 at 09:46
  • Data is commercially sensitive; question is not specific to data. Pick 205 random integers. You know you can do it :-) – omatai May 06 '20 at 04:56
  • There is still no workable example anywhere I am looking 2 years later – John Glen Mar 17 '22 at 00:44
  • So results may vary, but there is an example here https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/csv.ipynb?hl=de#scrollTo=M0iGXv9pC5kr. It does not work on my Fedora machine but others may have some luck. – John Glen Mar 23 '22 at 00:10

1 Answers1

-1

This question may provide some help... although the answers mostly relate to Tensorflow V1.x

It is likely that CSV Datasets are not required for this task. The data size indicated will probably fit in memory, and a tf.data.Dataset may wrap the data in more complexity than valuable functionality. You can do it without datasets (as shown below) so long as ALL the data is integers.

If you persist with the CSV Dataset approach, understand that there are many ways CSVs are used, and different approaches to loading them (e.g. see here and here). Because CSVs can have a variety of column types (numerical, boolean, text, categorical, ...), the first step is usually to load the CSV data in a column-oriented format. This provides access to the columns via their labels - useful for pre-processing. However, you probably want to provide rows of data to your model, so translating from columns to rows may be one source of confusion. At some point you will probably need to convert your integer data to float, but this may occur as a side-effect of certain pre-processing.

So long as your CSVs contain integers only, without missing data, and with a header row, you can do it without a tf.data.Dataset, step-by-step as follows:

import numpy as np
from numpy import genfromtxt
import tensorflow as tf

train_data = genfromtxt('train set.csv', delimiter=',')
test_data = genfromtxt('test set.csv', delimiter=',')
train_data = np.delete(train_data, (0), axis=0)    # delete header row
test_data = np.delete(test_data, (0), axis=0)      # delete header row
train_labels = train_data[:,[0]]
test_labels = test_data[:,[0]]
train_labels = tf.keras.utils.to_categorical(train_labels)
# count labels used in training set; categorise test set on same basis
# even if test set only uses subset of categories learning in training
K = len(train_labels[ 0 ])
test_labels = tf.keras.utils.to_categorical(test_labels, K)
train_data = np.delete(train_data, (0), axis=1)    # delete label column
test_data = np.delete(test_data, (0), axis=1)      # delete label column
# Data will have been read in as float... but you may want scaling/normalization...
scale = lambda x: x/1000.0 - 500.0                 # change to suit
scale(train_data)
scale(test_data)

N_train = len(train_data[0])        # columns in training set
N_test = len(test_data[0])          # columns in test set
if N_train != N_test:
  print("Datasets have incompatible column counts: %d vs %d" % (N_train, N_test))
  exit()
M_train = len(train_data)           # rows in training set
M_test = len(test_data)             # rows in test set

print("Training data size: %d rows x %d columns" % (M_train, N_train))
print("Test set data size: %d rows x %d columns" % (M_test, N_test))
print("Training to predict %d classes" % (K))

model = Sequential()
model.add(Dense(H, activation='relu', input_dim=N_train))     # H not yet defined...
...
model.compile(...)
model.fit( train_data, train_labels, ... )    # see docs for shuffle, batch, etc
model.evaluate( test_data, test_labels )
omatai
  • 3,448
  • 5
  • 47
  • 74
  • How does train_data take boundaries into account? – John Glen Mar 17 '22 at 00:45
  • How could it take boundaries into account? train_data is not a function, it's an array of data. Admittedly this answer does not directly address the question, but it addresses the question that motivated the title question. That is: had the question been "how can I use a sledgehammer to crack this particular nut?", this answer says "you can actually crack that nut with a different tool" Please don't downvote because you are disappointed with the failure to provide documentation on sledgehammers... – omatai Mar 22 '22 at 20:54
  • Sorry, I meant how does the model that is going to act on train_data take boundaries into account? I missed that train_data is a DataFrame with columns. I will take back the downvote (you have to edit), because in my frustration I downvoted because, as you said, "you can actually crack that nut with a different tool" which I should have listened to because make_csv_dataset was too experimental to work on my machine and I did end up using your approach because the data did fit into memory, and even if it didn't, it would have been trivial to load in batches. – John Glen Mar 23 '22 at 00:06