5

I am very new to TensorFlow and this might be a very beginner question. I have seen examples where custom datasets are converted to TFRecord files using the knowledge of the features one wants to use (for example-'image', 'label'). And while parsing this TFRecord file back, one has to know the features beforehand (i.e. 'image', 'label') in order to be able to use this dataset.

My question is- how do we parse TFRecord files where we do not know the features beforehand? Suppose someone gives me a TFRecord file and I want to decode all the associated features with this.

Some examples which I am referring to are: Link 1, Link 2

  • How do you intend to use the records if you don't know what data is in them? You might be able to read an example from the records file and list the available fields in there, along with their type, in order to write then code to parse it properly, is that what you want? – jdehesa Aug 24 '20 at 15:25
  • Yes please, that would certainly help. As for the intention part, I could think of a scenario where I only know a few features in the TFRecord dataset- say 'location' and 'temperature'. But there are also other features like 'humidity', 'elevation', and other related features present in the dataset encoded in it which I could use in the training process. – Sherine Brahma Aug 24 '20 at 17:49
  • Another scenario would be when a professor from another university from whom I requested the dataset mentions in the email that "Image" and "Location" are the features present. But the features which are actually there are "image_var" and "location_var". But you have no way to know now because he is probably too busy to reply back or is on holiday. – Sherine Brahma Aug 24 '20 at 17:49

1 Answers1

4

Here is something that might help. It's a function that goes through a records file and saves the available information about the features. You can modify it to just look at the first record and return that information, although depending on the case it may be useful to see all the records in case there are optional features only present in some of the or features with variable size.

import tensorflow as tf

def list_record_features(tfrecords_path):
    # Dict of extracted feature information
    features = {}
    # Iterate records
    for rec in tf.data.TFRecordDataset([str(tfrecords_path)]):
        # Get record bytes
        example_bytes = rec.numpy()
        # Parse example protobuf message
        example = tf.train.Example()
        example.ParseFromString(example_bytes)
        # Iterate example features
        for key, value in example.features.feature.items():
            # Kind of data in the feature
            kind = value.WhichOneof('kind')
            # Size of data in the feature
            size = len(getattr(value, kind).value)
            # Check if feature was seen before
            if key in features:
                # Check if values match, use None otherwise
                kind2, size2 = features[key]
                if kind != kind2:
                    kind = None
                if size != size2:
                    size = None
            # Save feature data
            features[key] = (kind, size)
    return features

You could use it like this

import tensorflow as tf

tfrecords_path = 'data.tfrecord'
# Make some test records
with tf.io.TFRecordWriter(tfrecords_path) as writer:
    for i in range(10):
        example = tf.train.Example(
            features=tf.train.Features(
                feature={
                    # Fixed length
                    'id': tf.train.Feature(
                        int64_list=tf.train.Int64List(value=[i])),
                    # Variable length
                    'data': tf.train.Feature(
                        float_list=tf.train.FloatList(value=range(i))),
                }))
        writer.write(example.SerializeToString())
# Print extracted feature information
features = list_record_features(tfrecords_path)
print(*features.items(), sep='\n')
# ('id', ('int64_list', 1))
# ('data', ('float_list', None))
jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • Thank you very much. I checked this one and this is working. I would say the TensorFlow code was rather complicated than the simple task it had to do. There are so many calls involved that it becomes very difficult for a beginner to understand. – Sherine Brahma Aug 24 '20 at 18:55
  • @SherineBrahma Glad it helps, please consider accepting the answer if you feel it solved your problem. The thing is you try to use TFRecords in an "unexpected way", so you have to navigate the protobuf data by hand, following the spec in [`example.proto`](https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/core/example/example.proto) and [`feature.proto`](https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/core/example/feature.proto). You generally don't need to work at such low level, as TF provides simpler functions for common uses (like `tf.io.parse_single_example`). – jdehesa Aug 25 '20 at 09:05
  • Yes, I accepted the answer. But apparently StackOverflow thinks that I am too inferior to do that. It says that I have less than 15 reputations and therefore my accepted answer will not be made public. – Sherine Brahma Aug 25 '20 at 09:10
  • @SherineBrahma Yes, you cannot upvote until you have 15 rep points (you're almost there :) ) but you can always accept an answer to your question with the green tick. That way other people can know that the question is already solved. – jdehesa Aug 25 '20 at 09:16