1

Say that I have a Tf.record file, and each row in the tf.records contain ints that are 0 or positive, and then padded with -1 so that all the rows are even size. So something like

0 3 43 223 23 -1 -1 -1
4 12 3  11  435 2 4 -1
9 3 11 32  34 322 9 7
. 
. 
. 

How do I randomly select 3 numbers from each of the rows ?

The numbers will act like indexes to look up values in an embedding matrix, and then those embeddings will be averaged (basically word2vec CBOW model).

More specifically, how do I avoid selecting the padding values of '-1'. -1 is just what I used to pad my rows so that each row will be the same size in order to use tf.record.(If there is a way to use varying length rows in tfrecords, let me know).

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116

1 Answers1

1

I think you're looking for something like tf.VarLenFeature(), more specifically, you do not necessarily have to pad your rows prior to creating the tfrecord file. You can create the tf_example,

from tensorflow.train import BytesList, Feature, Features, Example, Int64List

tf_example = Example(
    features=Features(
        feature={
            "my_feature": Feature(
                int64_list=Int64List(value=[0,3,43,223,23])
            )
        })
    )
)

with TFRecordWriter(tfrecord_file_path) as tf_writer:
    tf_writer.write(tf_example.SerializeToString())

Do this for all of your rows, that can vary in length.

You'll parse the tf_examples with something like,

def parse_tf_example(example):
    feature_spec = {
        "my_feature": tf.VarLenFeature(dtype=tf.int64)
    }
    return tf.parse_example([example], features=feature_spec)

Now, this will return your features as tf.SparseTensors, if you don't want to deal with that at this stage, and carry on using tensor ops as you would normally, you can simply use tf.sparse_tensor_to_dense() and carry on as you normally would with tensors.

The returned dense tensors will be of varying lengths, so you shouldn't have to worry about selecting '-1's, there won't be any. Unless you convert the sparse tensors to dense in batches, in that case the batches will be padded to the length of the longest tensor in the batch, and the padding value can be set by the default_value parameter.

That is in so far as your question about using varying length rows in tfrecords and getting back varying length tensors.

With regards to the lookup op, I haven't used it myself, but I think tf.nn.embedding_lookup_sparse() might help you out here, it offers the ability to lookup the embeddings from the sparse tensor, forgoing the need to convert it to a dense tensor first, and also has a combiner parameter to specify a reduction op on those embeddings, which in your case would be 'mean'.

I hope this helps in some way, good luck.

Sean Bugeja
  • 164
  • 3
  • 11
  • Thanks! Though I'm getting an error when I try to convert the sparse tensor to a dense one. I set `tfrecord_file_path = 'testTFrecord' ` and executed the first code block in your ansewr. Then I executed the second code block to define the function, then I executed `duh = parse_tf_example('testTFrecord')` . Then I executed `rows = tf.sparse_tensor_to_dense(duh)` but then I get `TypeError: Input must be a SparseTensor.` error – SantoshGupta7 Oct 31 '18 at 09:05
  • If you used the code as I posted it above the sparse tensor will be `duh["my_feature"]`. I know this stuff is a little tricky sometimes, minute differences make annoying differences in the output that are hard to look into. If it still gives you trouble, try firing up an iPython Notebook with `tf.enable_eager_execution()`, that will allow you to print out the contents of your variables, and make it easier to spot where your SparseTensor is. – Sean Bugeja Oct 31 '18 at 09:18
  • Also note, the `parse_tf_example(example)` function I provided is something I would normally map over a [tf.data.TFRecordDataset](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset), to get the actual data, not simply passing the path to the tfrecord file. You can find a better example of what I mean [here](https://www.tensorflow.org/guide/datasets#consuming_tfrecord_data), apologies if that wasn't clear in my answer. – Sean Bugeja Oct 31 '18 at 09:25