I am building a RNN language model in TensorFlow. My raw input consists of files of text. I am able to tokenize them, so that data I am working with is sequences of integers that are indexes into a vocabulary.
Following the example in ptb_word_lm.py
, I have written code to build a language model that gets its training data via the feed_dict method. However, I do not want to be limited to data sets that can fit in memory, so I would like to use file pipelines to read in data instead. I cannot find any examples of how to do this.
The file pipelines examples I've seen all have a tensor of some length n associated with a label that is a tensor of length 1. (The classic example being a 28 x 28 = 784 item tensor representing an MNIST bitmap associated with a single integer label that ranges from 0 to 9.) However, RNN training data consists of a vector of n consecutive tokens and a label also consisting of n consecutive tokens (shifted one ahead of the vector), for example:
"the quick brown fox jumped"
vectors (n=3): the quick brown, quick brown fox, brown fox jumped
labels (n=3): quick brown fox, brown fox jumped, fox jumped EOF
Can someone give me a code snippet that shows how to write a file pipeline to feed this shape of data into a TensorFlow graph?