5

I have a lot of CSV files with each record containing ~6000 columns. The first column is the label and the remaining columns should be treated as a feature vector. I'm new to Tensorflow and I can't figure out how to read the data into a Tensorflow Dataset with the desired format. I have the following code running currently:

DEFAULTS = []
n_features = 6170
for i in range(n_features+1):
  DEFAULTS.append([0.0])

def parse_csv(line):
    # line = line.replace('"', '')
    columns = tf.decode_csv(line, record_defaults=DEFAULTS)  # take a line at a time
    features = {'label': columns[-1], 'x': tf.stack(columns[:-1])}  # create a dictionary out of the features
    labels = features.pop('label')  # define the label

    return features, labels


def train_input_fn(data_file=sample_csv_file, batch_size=128):
    """Generate an input function for the Estimator."""
    # Extract lines from input files using the Dataset API.
    dataset = tf.data.TextLineDataset(data_file)
    dataset = dataset.map(parse_csv)
    dataset = dataset.shuffle(10000).repeat().batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()

Each CSV file has ~10K records. I've tried to do a sample eval on train_input_fn as labels = train_input_fn()[1].eval(session=sess). This gets 128 labels, but it's taking around 2 minutes.

Am I using some redundant operations or is there any better way to do this?

PS: I have the original data in Spark Dataframe. Hence, I can use TFRecords as well if that can make things faster.

Vijay Mariappan
  • 16,921
  • 3
  • 40
  • 59
Sai Kiriti Badam
  • 950
  • 16
  • 15
  • Let me know if my answer, addressed your problem. thanks – Vijay Mariappan May 24 '18 at 17:44
  • you even can create .tfrecord file yourself (if you don't have it) - like [here](https://stackoverflow.com/questions/41402332/tensorflow-create-a-tfrecords-file-from-csv) -- but for speed - test needed: "[There](https://www.tensorflow.org/tutorials/load_data/tfrecord) is no need to convert existing code to use TFRecords, unless you are using _tf.data_ and reading data is still the bottleneck to training" & _tf.data_ is reported to really simplify dealing with collections of files, otherwise just _from_tensor_slices(dict(df))_ is enough for csv_data, also for encoded images data TFrecord used – JeeyCi Apr 28 '22 at 13:36

1 Answers1

5

You are doing it right. But a faster way is to use TFRecords as shown in the following steps:

  1. Use tf.python_io.TFRecordWriter: -- To read the csv file and write it as a tfrecord file as shown here:Tensorflow create a tfrecords file from csv.

  2. Reading from the tfrecord : --

    def _parse_function(proto):
       f = {
           "features": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True),
           "label": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True)
           }
           parsed_features = tf.parse_single_example(proto, f)
           features = parsed_features["features"]
           label = parsed_features["label"]
           return features, label
    
    
    dataset = tf.data.TFRecordDataset(['csv.tfrecords'])
    dataset = dataset.map(_parse_function)
    dataset = dataset.shuffle(10000).repeat().batch(128)
    iterator = dataset.make_one_shot_iterator()
    features, label = iterator.get_next()
    

I ran both the cases (csv vs tfrecords) on a randomly generated csv. The total time for 10 batches (128 samples each) for a csv direct read was around 204s, while that of tfrecord was around 0.22s.

GrimSqueaker
  • 412
  • 5
  • 17
Vijay Mariappan
  • 16,921
  • 3
  • 40
  • 59
  • I've stumbled upon the link that you've provided in your answer just an hour after I posted the question.Hence, I haven't responded to your answer. Yes, I think your answer does answer the question. – Sai Kiriti Badam May 25 '18 at 08:16