1

I am building a RNN language model in TensorFlow. My raw input consists of files of text. I am able to tokenize them, so that data I am working with is sequences of integers that are indexes into a vocabulary.

Following the example in ptb_word_lm.py, I have written code to build a language model that gets its training data via the feed_dict method. However, I do not want to be limited to data sets that can fit in memory, so I would like to use file pipelines to read in data instead. I cannot find any examples of how to do this.

The file pipelines examples I've seen all have a tensor of some length n associated with a label that is a tensor of length 1. (The classic example being a 28 x 28 = 784 item tensor representing an MNIST bitmap associated with a single integer label that ranges from 0 to 9.) However, RNN training data consists of a vector of n consecutive tokens and a label also consisting of n consecutive tokens (shifted one ahead of the vector), for example:

"the quick brown fox jumped"
vectors (n=3): the quick brown, quick brown fox, brown fox jumped
labels (n=3): quick brown fox, brown fox jumped, fox jumped EOF

Can someone give me a code snippet that shows how to write a file pipeline to feed this shape of data into a TensorFlow graph?

W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111

0 Answers0