2

I have a text file containing one sentence per line

When I create a TextLineDataset and iterate on it with an iterator it returns the file line by line

I want to iterate through my file two tokens at a time, here's my current code:

sentences = tf.data.TextLineDataset("data/train.src")
iterator = sentences.make_initializable_iterator()
next_element = iterator.get_next()

sess = tf.Session()

sess.run(tf.tables_initializer())
sess.run(iterator.initializer)

elem = sess.run(next_element)
print(elem)

Is it possible to do so using a TextLineDataset ?

EDIT : By "tokens" I mean "words"

Valentin Macé
  • 1,150
  • 1
  • 10
  • 25
  • 1
    Just to clarify - when you say tokens are you referring to the elements of a line? Or rather to the line itself? – rvinas Jul 30 '19 at 21:28
  • I refer to the elements of a line, we can say "words" instead of "tokens" if you prefer. I'll edit my question, thanks – Valentin Macé Jul 31 '19 at 08:12

1 Answers1

3

Absolutely this is possible but you have a little bit of wrangling to do. You need to:

  1. split each line into words
  2. flatten this to a single stream of words
  3. batch into 2's

We can use tf.strings.split for 1.:

words = sentences.map(tf.strings.split)

and flat_map for 2.:

flat_words = words.flat_map(tf.data.Dataset.from_tensor_slices)

and batch for 3:

word_pairs = flat_words.batch(2)

and, of course, we could chain all these operations together to give us something like this:

word_pairs = sentences \
  .map(tf.strings.split) \
  .flat_map(tf.data.Dataset.from_tensor_slices) \
  .batch(2)
Stewart_R
  • 13,764
  • 11
  • 60
  • 106