Read CSV file using tf.data is very slow, use tfrecords instead?

Question

I have a lot of CSV files with each record containing ~6000 columns. The first column is the label and the remaining columns should be treated as a feature vector. I'm new to Tensorflow and I can't figure out how to read the data into a Tensorflow Dataset with the desired format. I have the following code running currently:

DEFAULTS = []
n_features = 6170
for i in range(n_features+1):
  DEFAULTS.append([0.0])

def parse_csv(line):
    # line = line.replace('"', '')
    columns = tf.decode_csv(line, record_defaults=DEFAULTS)  # take a line at a time
    features = {'label': columns[-1], 'x': tf.stack(columns[:-1])}  # create a dictionary out of the features
    labels = features.pop('label')  # define the label

    return features, labels


def train_input_fn(data_file=sample_csv_file, batch_size=128):
    """Generate an input function for the Estimator."""
    # Extract lines from input files using the Dataset API.
    dataset = tf.data.TextLineDataset(data_file)
    dataset = dataset.map(parse_csv)
    dataset = dataset.shuffle(10000).repeat().batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()

Each CSV file has ~10K records. I've tried to do a sample eval on train_input_fn as labels = train_input_fn()[1].eval(session=sess). This gets 128 labels, but it's taking around 2 minutes.

Am I using some redundant operations or is there any better way to do this?

PS: I have the original data in Spark Dataframe. Hence, I can use TFRecords as well if that can make things faster.

you even can create .tfrecord file yourself (if you don't have it) - like [here](https://stackoverflow.com/questions/41402332/tensorflow-create-a-tfrecords-file-from-csv) -- but for speed - test needed: "[There](https://www.tensorflow.org/tutorials/load_data/tfrecord) is no need to convert existing code to use TFRecords, unless you are using _tf.data_ and reading data is still the bottleneck to training" & _tf.data_ is reported to really simplify dealing with collections of files, otherwise just _from_tensor_slices(dict(df))_ is enough for csv_data, also for encoded images data TFrecord used — JeeyCi, Apr 28 '22 at 13:36

score 5 · Accepted Answer · edited Jun 06 '21 at 10:15

You are doing it right. But a faster way is to use TFRecords as shown in the following steps:

Use tf.python_io.TFRecordWriter: -- To read the csv file and write it as a tfrecord file as shown here:Tensorflow create a tfrecords file from csv.

Reading from the tfrecord : --

def _parse_function(proto):
   f = {
       "features": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True),
       "label": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True)
       }
       parsed_features = tf.parse_single_example(proto, f)
       features = parsed_features["features"]
       label = parsed_features["label"]
       return features, label


dataset = tf.data.TFRecordDataset(['csv.tfrecords'])
dataset = dataset.map(_parse_function)
dataset = dataset.shuffle(10000).repeat().batch(128)
iterator = dataset.make_one_shot_iterator()
features, label = iterator.get_next()

I ran both the cases (csv vs tfrecords) on a randomly generated csv. The total time for 10 batches (128 samples each) for a csv direct read was around 204s, while that of tfrecord was around 0.22s.

I've stumbled upon the link that you've provided in your answer just an hour after I posted the question.Hence, I haven't responded to your answer. Yes, I think your answer does answer the question. — Sai Kiriti Badam, May 25 '18 at 08:16

Read CSV file using tf.data is very slow, use tfrecords instead?

1 Answers1