Ngrams from Tensorflow TextLineDataset

Question

I have a text file containing one sentence per line

When I create a TextLineDataset and iterate on it with an iterator it returns the file line by line

I want to iterate through my file two tokens at a time, here's my current code:

sentences = tf.data.TextLineDataset("data/train.src")
iterator = sentences.make_initializable_iterator()
next_element = iterator.get_next()

sess = tf.Session()

sess.run(tf.tables_initializer())
sess.run(iterator.initializer)

elem = sess.run(next_element)
print(elem)

Is it possible to do so using a TextLineDataset ?

EDIT : By "tokens" I mean "words"

Just to clarify - when you say tokens are you referring to the elements of a line? Or rather to the line itself? — rvinas, Jul 30 '19 at 21:28
I refer to the elements of a line, we can say "words" instead of "tokens" if you prefer. I'll edit my question, thanks — Valentin Macé, Jul 31 '19 at 08:12

score 3 · Accepted Answer · answered Aug 02 '19 at 06:55

Absolutely this is possible but you have a little bit of wrangling to do. You need to:

split each line into words
flatten this to a single stream of words
batch into 2's

We can use tf.strings.split for 1.:

words = sentences.map(tf.strings.split)

and flat_map for 2.:

flat_words = words.flat_map(tf.data.Dataset.from_tensor_slices)

and batch for 3:

word_pairs = flat_words.batch(2)

and, of course, we could chain all these operations together to give us something like this:

word_pairs = sentences \
  .map(tf.strings.split) \
  .flat_map(tf.data.Dataset.from_tensor_slices) \
  .batch(2)

Ngrams from Tensorflow TextLineDataset

1 Answers1