1

We have tfrecord files where each tfrecord file contains a single example, but where the features contain a list of values. We are using tf.data.Dataset in the following manner:

n_rows_per_record_file = 100

def parse_tfrecord_to_example(record_bytes):
    col_map = {
    "my_col": tf.io.FixedLenFeature(
        shape=n_rows_per_record_file, dtype=tf.int64
    )}

ds = (
    tf.data.TFRecordDataset(file_paths)
    .map(parse_tfrecord_to_example)
    )

instead of using a fixed constant for n_rows_per_record_file we would like to lookup the number of rows given the filepath.

Any ideas on how to achieve this ?

We tried using something like this

def get_shape(filepath):
    return filepath, shapes[filepath]
ds = (
    tf.data.list_files(file_paths)
    .map(get_shape)
    .map(
        lambda f, shape: tf.data.TFRecordDataset(f).map(
           lambda shape: parse_tfrecord_to_example(shape)
       )
    )

but this fails because tf.data doesn't eagerly evaluate the filepath until it needs to (i.e. it remains as a tf.Tensor)

marwan
  • 504
  • 4
  • 14
  • Why don't you just use `tf.io.VarLenFeature` since your length feature size is changing? See detailed explanation [here](https://stackoverflow.com/a/47967475/1719231) – PermanentPon Jun 06 '21 at 16:51
  • @PermanentPon thank you for the suggestion - this is the workaround I am currently employing but I would like to understand how to dynamically pass information given the filename as I have multiple use-cases which require this behavior (i.e. not just passing the shape information) – marwan Jun 07 '21 at 21:34
  • Are you trying to dynamically merge your TFRecords data with data stored in a python object? Like you do in `get_shapes` method by accessing `shapes` dictionary. Is that the main problem you are trying to solve? – PermanentPon Jun 07 '21 at 22:03
  • Yes, I suppose you can say so @PermanentPon. For instance, one use-case is I would like to parse the filename to extract partition column values, and then broadcast and zip them with the rest of the dataset. The problem lies in the fact that using tf native operators makes this hard to implement, and making use of tf.py_function to wrap python operations and return nested output is very cumbersome so I am wondering if there is a workaround that is usually employed to get a higher level of flexibility without sacrificing performance much. – marwan Jun 08 '21 at 01:20

1 Answers1

0

Your proposed solution looks ok, yes your filepath will be a tensor but you can't use some external python objects like shapes in your case. If you use tf.Data unfortunately you need to learn a lot of Tensorflow specific functions to do 'basic' python things. For example, in your case maybe you want to split the file name and then cast a string to int. So yes, everything is a tensor.

In your comments you also mentioned broadcasting. tf.Data is not for broadcasting. tf.Data is for fast loading of data in memory record by record. So, whenever you think to apply vectorisation or broadcasting you should use something else. First option, prepare your data before you save it in TFRecords using whatever tool you want: pandas, dask, spark, etc. Second option, enrich your data on the fly with one of tf lookup implementations. For example, if you have a dictionary with shapes and you want to add this feature to every record based on some category or id, load that data to StaticHashTable and add a preprocessing lookup step. Note: this data for enrichments have to be quite small as you'll have to be in memory, maybe even GPU memory if you use GPUs.

So here is an example with a lookup table:

dataset = tf.data.Dataset.from_tensor_slices([range(10)])
keys_tensor = tf.constant(range(10))
vals_tensor = tf.constant(range(100, 110))
lookup = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(keys_tensor, vals_tensor), default_value=-1)

def map_numbers(v):
    return lookup[v]

for element in dataset.map(map_numbers):
    print(element)

tf.Tensor([100 101 102 103 104 105 106 107 108 109], shape=(10,), dtype=int32)
PermanentPon
  • 702
  • 5
  • 10
  • 1
    Thank you @PermanentPon for the answer _ I was hoping that I could use native python code, but given that seems to be impossible, I will try to familiarize myself with the tf specific code. It is only fair you are awarded the bounty for confirming this and providing a sample code snippet. Thanks again – marwan Jun 09 '21 at 14:44
  • Sorry, I couldn't provide the solution you expected, but glad it was helpful. – PermanentPon Jun 10 '21 at 09:14