Working with variable-length text in Tensorflow

Question

I am building a Tensorflow model to perform inference on text phrases. For sake of simplicity, assume I need a classifier with fixed number of output classes but a variable-length text in input. In other words, my mini batch would be a sequence of phrases but not all phrases have the same length.

data = ['hello',
        'my name is Mark',
        'What is your name?']

My first preprocessing step was to build a dictionary of all possible words in the dictionary and map each word to its integer word-Id. The input becomes:

data = [[1],
        [2, 3, 4, 5],
        [6, 4, 7, 3]

What's the best way to handle this kind of input? Can tf.placeholder() handle variable-size input within the same batch of data? Or should I pad all strings such that they all have the same length, equal to the length of the longest string, using some placeholder for the missing words? This seems to be very memory inefficient if some string are much longer that most of the others.

-- EDIT --

Here is a concrete example.

When I know the size of my datapoints (and all the datapoint have the same length, eg. 3) I normally use something like:

input = tf.placeholder(tf.int32, shape=(None, 3)

with tf.Session() as sess:
  print(sess.run([...], feed_dict={input:[[1, 2, 3], [1, 2, 3]]}))

where the first dimension of the placeholder is the minibatch size.

What if the input sequences are words in sentences of different length?

feed_dict={input:[[1, 2, 3], [1]]}

Text is often dealt with by a sequence model. IE, your model accepts current word and output of the previous step, and you stack copies of the model. As a baseline, you could start with "bag of words" -- just add all the words together into single dictionary vector. — Yaroslav Bulatov, Jul 27 '16 at 17:42
Thanks for your reply. My question is more about Tensorflow's data structures than about models. I can use an RNN fed with text represented with bag-of-words. Still if my datapoints have different length, where or how do I store this kind of data? — Marco Ancona, Jul 28 '16 at 06:23
I edited the question removing reference to word embeddings and put a more concrete example to clarify my question. — Marco Ancona, Jul 28 '16 at 06:52

score 5 · Answer 1 · answered Jun 19 '17 at 18:03

The other two answers are correct, but low on details. I was just looking at how to do this myself.

There is machinery in TensorFlow to to all of this (for some parts it may be overkill).

Starting from a string tensor (shape [3]):

import tensorflow as tf
lines = tf.constant([
    'Hello',
    'my name is also Mark',
    'Are there any other Marks here ?'])
vocabulary = ['Hello', 'my', 'name', 'is', 'also', 'Mark', 'Are', 'there', 'any', 'other', 'Marks', 'here', '?']

The first thing to do is split this into words (note the space before the question mark.)

words = tf.string_split(lines," ")

Words will now be a sparse tensor (shape [3,7]). Where the two dimensions of the indices are [line number, position]. This is represented as:

indices    values
 0 0       'hello'
 1 0       'my'
 1 1       'name'
 1 2       'is'
 ...

Now you can do a word lookup:

table = tf.contrib.lookup.index_table_from_tensor(vocabulary)
word_indices = table.lookup(words)

This returns a sparse tensor with the words replaced by their vocabulary indices.

Now you can read out the sequence lengths by looking at the maximum position on each line :

line_number = word_indices.indices[:,0]
line_position = word_indices.indices[:,1]
lengths = tf.segment_max(data = line_position, 
                         segment_ids = line_number)+1

So if you're processing variable length sequences it's probably to put in an lstm ... so let's use a word-embedding for the input (it requires a dense input):

EMBEDDING_DIM = 100

dense_word_indices = tf.sparse_tensor_to_dense(word_indices)
e_layer = tf.contrib.keras.layers.Embedding(len(vocabulary), EMBEDDING_DIM)
embedded = e_layer(dense_word_indices)

Now embedded will have a shape of [3,7,100], [lines, words, embedding_dim].

Then a simple lstm can be built:

LSTM_SIZE = 50
lstm = tf.nn.rnn_cell.BasicLSTMCell(LSTM_SIZE)

And run the across the sequence, handling the padding.

outputs, final_state = tf.nn.dynamic_rnn(
    cell=lstm,
    inputs=embedded,
    sequence_length=lengths,
    dtype=tf.float32)

Now outputs has a shape of [3,7,50], or [line,word,lstm_size]. If you want to grab the state at the last word of each line you can use the (hidden! undocumented!) select_last_activations function:

from tensorflow.contrib.learn.python.learn.estimators.rnn_common import select_last_activations
final_output = select_last_activations(outputs,tf.cast(lengths,tf.int32))

That does all the index shuffling to select the output from the last timestep. This gives a size of [3,50] or [line, lstm_size]

init_t = tf.tables_initializer()
init = tf.global_variables_initializer()
with tf.Session() as sess:
    init_t.run()
    init.run()
    print(final_output.eval().shape())

I haven't worked out the details yet but I think this could probably all be replaced by a single tf.contrib.learn.DynamicRnnEstimator.

score 1 · Answer 2 · answered Mar 28 '18 at 08:54

How about this? (I didn’t implement this. but maybe this idea will work.) This method is based on BOW representation.

Get your data as tf.string
Split it using tf.string_split
Find indexes of your words using tf.contrib.lookup.string_to_index_table_from_file or tf.contrib.lookup.string_to_index_table_from_tensor. Length of this tensor can vary.
Find embeddings of your indexes.

    word_embeddings = tf.get_variable(“word_embeddings”,
                                      [vocabulary_size, embedding_size])
    embedded_word_ids = tf.nn.embedding_lookup(word_embeddings, word_ids)`

Sum up the embeddings. And you will get a tensor of fixed length(=embedding size). Maybe you can choose another method then sum.(avg, mean or something else)

Maybe it’s too late :) Good luck.

score 0 · Answer 3 · answered Jul 28 '16 at 10:50

I was building a sequence to sequence translator the other day. What I did is decided to do was make it for a fixed length of 32 words (which was a bit above the average sentence length) although you can make it as long as you want. I then added a NULL word to the dictionary and padded all my sentence vectors with it. That way I could tell the model where the end of my sequence was and the model would just output NULL at the end of its output. For instance take the expression "Hi what is your name?" This would become "Hi what is your name? NULL NULL NULL NULL ... NULL". It worked pretty well but your loss and accuracy during training will appear a bit higher than it actually is since the model usually gets the NULLs right which count towards the cost.

There is another approach called masking. This too allows you to build a model for a fixed length sequence but only evaluate the cost up to the end of a shorter sequence. You could search for the first instance of NULL in the output sequence (or expected output, whichever is greater) and only evaluate the cost up to that point. Also I think some tensor flow functions like tf.dynamic_rnn support masking which may be more memory efficient. I am not sure since I have only tried the first approach of padding.

Finally, I think in the tensorflow example of Seq2Seq model they use buckets for different sized sequences. This would probably solve your memory issue. I think you could share the variables between the different sized models.

score 0 · Answer 4 · answered Mar 01 '17 at 05:40

So here is what I did (not sure if its 100% the right way to be honest):

In your vocab dict where each key is a number pointing to one particular word, add another key say K which points to "<PAD>"(or any other representation you want to use for padding)

Now your placeholder for input would look something like this:

x_batch = tf.placeholder(tf.int32, shape=(batch_size, None))

where None represents the largest phrase/sentence/record in your mini batch.

Another small trick I used was to store the length of each phrase in my mini batch. For example:

If my input was: x_batch = [[1], [1,2,3], [4,5]] then I store: len_batch = [1, 3, 2]

Later I use this len_batch and the max size of a phrase(l_max) in my minibatch to create a binary mask. Now l_max=3 from above, so my mask would look something like this:

mask = [
[1, 0, 0],
[1, 1, 1],
[1, 1, 0]
]

Now if you multiply this with your loss you would basically eliminate all loss introduced as a result of padding.

Hope this helps.

Working with variable-length text in Tensorflow

4 Answers4