3

I have a feature column that's just a string:

tf.FixedLenFeature((), tf.string)

My graph converts the tensors to binary with tf.decode_raw:

tf.decode_raw(features['text'], tf.uint8)

This works for batch_size = 1, but fails for batch_size > 1 when the strings have different lengths. decode_raw throws DecodeRaw requires input strings to all be the same size.

Is there an alternative to tf.decode_raw that returns a padded tensor and the lengths of the strings?

alltom
  • 3,162
  • 4
  • 31
  • 47

1 Answers1

4

I'd use a tf.data.Dataset. With eager execution enabled:

import tensorflow as tf
import tensorflow.contrib.eager as tfe
tfe.enable_eager_execution()

def _decode_and_length_map(encoded_string):
  decoded = tf.decode_raw(encoded_string, out_type=tf.uint8)
  return decoded, tf.shape(decoded)[0]

inputs = tf.constant(["aaa", "bbbbbbbb", "abcde"], dtype=tf.string)
dataset = (tf.data.Dataset.from_tensor_slices(inputs)
           .map(_decode_and_length_map)
           .padded_batch(batch_size=2, padded_shapes=([None], [])))
iterator = tfe.Iterator(dataset)
print(iterator.next())
print(iterator.next())

Prints (disclaimer: manually reformatted)

(<tf.Tensor: id=24, shape=(2, 8), dtype=uint8,
     numpy=array([[97, 97, 97,  0,  0,  0,  0,  0],
                  [98, 98, 98, 98, 98, 98, 98, 98]], dtype=uint8)>,
 <tf.Tensor: id=25, shape=(2,), dtype=int32, numpy=array([3, 8], dtype=int32)>)

(<tf.Tensor: id=28, shape=(1, 5), dtype=uint8, 
     numpy=array([[ 97,  98,  99, 100, 101]], dtype=uint8)>,
 <tf.Tensor: id=29, shape=(1,), dtype=int32, numpy=array([5], dtype=int32)>)

Of course you can mix and match data sources, add randomization, change the padding character, etc.

Also works with graph building:

import tensorflow as tf

def _decode_and_length_map(encoded_string):
  decoded = tf.decode_raw(encoded_string, out_type=tf.uint8)
  return decoded, tf.shape(decoded)[0]

inputs = tf.constant(["aaa", "bbbbbbbb", "abcde"], dtype=tf.string)
dataset = (tf.data.Dataset.from_tensor_slices(inputs)
           .map(_decode_and_length_map)
           .padded_batch(batch_size=2, padded_shapes=([None], [])))
batch_op = dataset.make_one_shot_iterator().get_next()
with tf.Session() as session:
  print(session.run(batch_op))
  print(session.run(batch_op))
Allen Lavoie
  • 5,778
  • 1
  • 17
  • 26
  • This looks like it does exactly what I asked for, so have a vote, and thanks! But does eager mode work with tf.estimator.Estimator? I left that part out of the question because I didn't realize it was relevant. – alltom Jan 26 '18 at 06:54
  • Not yet unfortunately; you'll need to stick with the graph building version in a `model_fn`. – Allen Lavoie Jan 26 '18 at 17:16
  • I can use your graph example in a `model_fn`, but replacing the constant `inputs` with a `features['text0']` tensor yields: `ValueError: Cannot capture a stateful node (name:IteratorGetNext, type:IteratorGetNext) by value.` from `make_one_shot_iterator()`. Is there any way around that? – alltom Jan 28 '18 at 02:16
  • 1
    Per https://stackoverflow.com/a/44504063/129889 it looks like I should use `make_initializable_iterator()` instead. Trying… – alltom Jan 28 '18 at 02:21
  • I'm going to accept because this seems like it should work in principle. :) I'm struggling to make it work, but for all I can tell, it's due to problems in the rest of my graph. Thanks for the answer! – alltom Jan 28 '18 at 03:51
  • 1
    Just realized that should have been `input_fn`. The error might go away if you fold this bit into your existing input pipeline (single Dataset) then make an `Iterator` out of that? – Allen Lavoie Jan 29 '18 at 17:41
  • 1
    Oh thank you so much for the correction. That worked! I added one Dataset.map() to expand `features['text']` into `features['text_bytes']` and `features['text_length']` by adapting your `_decode_and_length_map`, then used `padded_batch(batch_size, padded_shapes = ({'text_bytes': (None,), 'text_length': ()}, ()))`. – alltom Jan 30 '18 at 09:37