I am building a Tensorflow model to perform inference on text phrases. For sake of simplicity, assume I need a classifier with fixed number of output classes but a variable-length text in input. In other words, my mini batch would be a sequence of phrases but not all phrases have the same length.
data = ['hello',
'my name is Mark',
'What is your name?']
My first preprocessing step was to build a dictionary of all possible words in the dictionary and map each word to its integer word-Id. The input becomes:
data = [[1],
[2, 3, 4, 5],
[6, 4, 7, 3]
What's the best way to handle this kind of input? Can tf.placeholder() handle variable-size input within the same batch of data? Or should I pad all strings such that they all have the same length, equal to the length of the longest string, using some placeholder for the missing words? This seems to be very memory inefficient if some string are much longer that most of the others.
-- EDIT --
Here is a concrete example.
When I know the size of my datapoints (and all the datapoint have the same length, eg. 3) I normally use something like:
input = tf.placeholder(tf.int32, shape=(None, 3)
with tf.Session() as sess:
print(sess.run([...], feed_dict={input:[[1, 2, 3], [1, 2, 3]]}))
where the first dimension of the placeholder is the minibatch size.
What if the input sequences are words in sentences of different length?
feed_dict={input:[[1, 2, 3], [1]]}