2

I'm trying to create a model in TensorFlow which predicts ideal item for a user by predicting a vector of numbers. I have created a dataset in Spark and saved it as a TFRecord using Spark TensorFlow connector. In the dataset, I have several hundreds of features and 20 labels in each row. For easier manipulation, I have given every column a prefix 'feature_' or 'label_'. Now I'm trying to write input function for TensorFlow, but I can't figure out how to parse the data. So far I have written this:

def dataset_input_fn():
    path = ['data.tfrecord']
    dataset = tf.data.TFRecordDataset(path)
    def parser(record):
        example = tf.train.Example()
        example.ParseFromString(record)

        # TODO: no idea what to do here
        # features = parsed["features"]
        # label = parsed["label"]

        # return features, label

    dataset = dataset.map(parser)
    dataset = dataset.shuffle(buffer_size=10000)
    dataset = dataset.batch(32)
    dataset = dataset.repeat(100)
    iterator = dataset.make_one_shot_iterator()

    features, labels = iterator.get_next()
    return features, labels

How can I split the Example into a feature set and a label set? I have tried to split the Example into two parts, but there is no way to even access it. The only way I have managed to access it is by printing the example out, which gives me something like this.

features {
...
  feature {
    key: "feature_wishlist_hour"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "label_emb_1"
    value {
      float_list {
        value: 0.4
      }
    }
  }
  feature {
    key: "label_emb_2"
    value {
      float_list {
        value: 0.8
      }
    }
  }
...
}
Ondrej
  • 414
  • 1
  • 5
  • 13

1 Answers1

3

Your parser function should be similar to how you constructed the example proto. In your case its should be something similar to:

# example proto decode
def parser(example_proto):
   keys_to_features = {'feature_wishlist_hour':tf.FixedLenFeature((), tf.int64),
                    'label_emb_1': tf.FixedLenFeature((), tf.float32),
                    'label_emb_2': tf.FixedLenFeature((), tf.float32)}

   parsed_features = tf.parse_single_example(example_proto, keys_to_features)
   return parsed_features['feature_wishlist_hour'], (parsed_features['label_emb_1'], parsed_features['label_emb_2'])

EDIT: From the comments it seems you are encoding each of the features as key, value pair, which is not right. Check this answer: Numpy to TFrecords: Is there a more simple way to handle batch inputs from tfrecords? on how to write it in a proper way.

Vijay Mariappan
  • 16,921
  • 3
  • 40
  • 59
  • thx, this solved part of my problem (how to send multiple labels), but the second problem is that I don't know how many columns are on the input and some of them are string and some are floats. Do you have any idea how can I get list of columns from the example? I tried everything but i could only get it to print out in the format I have written, but I can't get any list or iterator out of it. Thx – Ondrej Jun 07 '18 at 07:04
  • Ok, i think your are writing the proto example wrong. Write your whole features and labels for each input in a single list. Check my answer here on how to do it: https://stackoverflow.com/questions/45427637/numpy-to-tfrecords-is-there-a-more-simple-way-to-handle-batch-inputs-from-tfrec/45428167#45428167 – Vijay Mariappan Jun 07 '18 at 08:24