18

About the tf.contrib.data.Dataset (from TensorFlow 1.2, see here and here) usage: The way how to get data doesn't really fit any way how I get the data usually. In my case, I have a thread and I receive data there and I don't know in advance when it will end but I see when it ends. Then I wait until I processed all the buffers and then I have finished one epoch. How can I get this logic with the Dataset?

Note that I prefer the Dataset interface over the QueueBase interface because it gives me the iterator interface which I can reinitialize and even reset to a different Dataset. This is more powerful compared to queues which cannot be reopened currently after they are closed (see here and here).

Maybe a similar question, or the same question: How can I wrap around a Dataset over a queue? I have some thread with reads some data from somewhere and which can feed it and queue it somehow. How do I get the data into the Dataset? I could repeat some dummy tensor infinite times and then use map to just return my queue.dequeue() but that really only gets me back to all the original problems with the queue, i.e. how to reopen the queue.

Albert
  • 65,406
  • 61
  • 242
  • 386
  • Did you find a convenient way to do this? I'm facing the same issue actually and am starting to consider using the `Dataset` API instead of raw data loading (I find the packaging of `Dataset` much more elegant). – Vince.Bdn Jun 22 '17 at 08:44
  • @Vince.Bdn: No, I did not get any response and I think that there is currently no way to do that, unless the TF devs add such functionality. An ongoing discussion about missing functionality in `Dataset` is [here](https://github.com/tensorflow/tensorflow/issues/7951), so maybe comment there and refer to me (@albertz on GitHub) and this StackOverflow question. – Albert Jun 22 '17 at 11:46

1 Answers1

8

The new Dataset.from_generator() method allows you to define a Dataset that is fed by a Python generator. (To use this feature at present, you must download a nightly build of TensorFlow or build it yourself from source. It will be part of TensorFlow 1.4.)

The easiest way to implement your example would be to replace your receiving thread with a generator, with pseudocode as follows:

def receiver():
  while True:
    next_element = ...  # Receive next element from external source.
                        # Note that this method may block.

    end_of_epoch = ...  # Decide whether or not to stop based on next_element.

    if not end_of_epoch:
      yield next_element  # Note: you may need to convert this to an array.
    else:
      return  # Returning will signal OutOfRangeError on downstream iterators.

dataset = tf.contrib.data.Dataset.from_generator(receiver, output_types=...)

# You can chain other `Dataset` methods after the generator. For example:
dataset = dataset.prefetch(...)  # This will start a background thread
                                 # to prefetch elements from `receiver()`.

dataset = dataset.repeat(...)  # Note that each repetition will call
                               # `receiver()` again, and start from
                               # a fresh state.

dataset = dataset.batch(...)

More complicated topologies are possible. For example, you can use Dataset.interleave() to create many receivers in parallel.

Massinissa
  • 95
  • 1
  • 10
mrry
  • 125,488
  • 26
  • 399
  • 400
  • Is it possible to have feed repetitions of the dataset through the graph, with an `OutOfRangeError` thrown between each exception? I wrote [this example](https://gist.github.com/samwhitlock/93c955b26a329cf2e34c932abff86199) which emits 10 elements per iteration with a batch size of 3, so the "epochs" get mixed (first integer in each of the output tuples per batch). What I'd really like to do is emit a smaller batch at the end of each epoch and then move onto the next one (without restarting the graph maybe?). Is this possible in Tensorflow? – Sam Aug 29 '17 at 09:36
  • 1
    @SamWhitlock I'm not sure if this fully addresses your problem, but - since you're using an initializable iterator - one possibility is to replace the `Dataset.repeat(3)` with a Python `for epoch in range(3):` loop, inside which you can catch the `OutOfRangeError` and re-run the initializer. If that doesn't work, feel free to create a new question so we can go into it in detail! – mrry Aug 29 '17 at 14:48
  • I tried rerunning the initializer, but I can't seem to make it work. I created a new question for it here https://stackoverflow.com/questions/45956139/resetting-a-tensorflow-graph-after-outofrangeerror-when-using-dataset – Sam Aug 30 '17 at 09:01
  • really, generator as stub is very convinient tool for code-testing -- thanks for the idea – JeeyCi May 06 '22 at 18:21