18

I am trying to make use of queues for loading data from files in Tensorflow.

I would like to to run the graph with validation data at the end of each epoch to get a better feel for how the training is going.

That is where i am running into problems. I cant seem to figure out how to make the switch between training data and validation data when using queues.

I have stripped down my code to a bare minimum toy example to make it easier to get help. Instead of including all the code that loads the image files, performs inference, and training, I have chopped it off at the point where the filenames are loaded into the queue.

import tensorflow as tf

#  DATA
train_items = ["train_file_{}".format(i) for i in range(6)]
valid_items = ["valid_file_{}".format(i) for i in range(3)]

# SETTINGS
batch_size = 3
batches_per_epoch = 2
epochs = 2

# CREATE GRAPH
graph = tf.Graph()
with graph.as_default():
    file_list = tf.placeholder(dtype=tf.string, shape=None)
    
    # Create a queue consisting of the strings in `file_list`
    q = tf.train.string_input_producer(train_items, shuffle=False, num_epochs=None)
    
    # Create batch of items.
    x = q.dequeue_many(batch_size)
    
    # Inference, train op, and accuracy calculation after this point
    # ...


# RUN SESSION
with tf.Session(graph=graph) as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())
    
    # Start populating the queue.
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    
    try:
        for epoch in range(epochs):
            print("-"*60)
            for step in range(batches_per_epoch):
                if coord.should_stop():
                    break
                train_batch = sess.run(x, feed_dict={file_list: train_items})
                print("TRAIN_BATCH: {}".format(train_batch))
    
            valid_batch = sess.run(x, feed_dict={file_list: valid_items})
            print("\nVALID_BATCH : {} \n".format(valid_batch))
    
    except Exception, e:
        coord.request_stop(e)
    finally:
        coord.request_stop()
        coord.join(threads)

Variations and experiments

Trying different values for num_epochs

num_epochs=None

If i set the num_epochs argument in tf.train.string_input_producer()to None it gives be the following output, which shows that it is running two epochs as intended, but it is using data from the training set when running evaluation.

------------------------------------------------------------
TRAIN_BATCH: ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']

VALID_BATCH : ['train_file_0' 'train_file_1' 'train_file_2']

------------------------------------------------------------
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']
TRAIN_BATCH: ['train_file_0' 'train_file_1' 'train_file_2']

VALID_BATCH : ['train_file_3' 'train_file_4' 'train_file_5']

num_epochs=2

If i set the num_epochs argument in tf.train.string_input_producer() to 2 it gives be the following output, which shows that it is not even running the full two batches at all (and evaliation is still using training data)

------------------------------------------------------------
TRAIN_BATCH: ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']

VALID_BATCH : ['train_file_0' 'train_file_1' 'train_file_2']

------------------------------------------------------------
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']

num_epochs=1

If i set the num_epochs argument in tf.train.string_input_producer() to 1 in the hopes that it will flush out any aditional training data from the queue so it can make use of the validation data, i get the following output, which shows that it is terminating as soon as it gets through one epoch of training data, and does not get to go through loading evaluation data.

------------------------------------------------------------
TRAIN_BATCH: ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']

Setting capacity argument to various values

I have also tried setting the capacity argument in tf.train.string_input_producer() to small values, such as 3, and 1. But these had no effect on the results.

What other approach should I take?

What other approach could i take to switch between training and validation data? Would i have to create separate queues? I am at a loss as to how to get that to work. Would i have to create additional coordinators and queue runners as well?

Community
  • 1
  • 1
ronrest
  • 1,192
  • 10
  • 17
  • Is not your queue is always being created with train_list? "q = tf.train.string_input_producer(train_items, shuffle=False, num_epochs=None)" – amin__ Aug 15 '18 at 17:15

4 Answers4

10

I am compiling a list of potential approaches that might solve this issue here. Most of these are just vague suggestions, with no actual code examples to show how to make use of them.

Placeholder with default

Suggested here

Using tf.cond()

Suggested here

Also suggested by sygi on this very stackoverflow thread. link

using tf.group() and tf.cond()

Suggested here

make_template() method

Suggested here and here

Shared weights method

suggested by sygi in this very stackoverflow thread (link). This might be the same as make_template() method.

QueueBase() Method.

Suggested here with sample code here Code adapted to my problem here on this thread. link

training bucket method

Suggested here

Community
  • 1
  • 1
ronrest
  • 1,192
  • 10
  • 17
  • op, have you been able to find the best solution? I've been stuck on this for the past couple of days. – Shaayaan Sayed Jan 18 '17 at 22:46
  • Here is another approach which uses dequeue inside the tf.cond statement: https://groups.google.com/a/tensorflow.org/d/msg/discuss/mLrt5qc9_uU/gU8HRYOuCwAJ Not sure it actually works. – Lenar Hoyt Jun 15 '17 at 12:40
8

First, you can manually read the examples in your code (to numpy arrays) and pass it in any way you want:

data = tf.placeholder(tf.float32, [None, DATA_SHAPE])
for _ in xrange(num_epochs):
  some_training = read_some_data()
  sess.run(train_op, feed_dict={data: some_training})
  some_testing = read_some_test_data()
  sess.run(eval_op, feed_dict={data: some_testing})

If you need to use Queues, you can try to conditionally change the queue from "training" to "testing" one:

train_filenames = tf.string_input_producer(["training_file"])
train_q = some_reader(train_filenames)
test_filenames = tf.string_input_producer(["testing_file"])
test_q = some_reader(test_filenames)

am_testing = tf.placeholder(dtype=bool,shape=())
data = tf.cond(am_testing, lambda:test_q, lambda:train_q)
train_op, accuracy = model(data)

for _ in xrange(num_epochs):
  sess.run(train_op, feed_dict={am_testing: False})
  sess.run(accuracy, feed_dict={am_testing: True})

The second approach is considered unsafe though -- in this post it is encouraged to build two separate graphs for training and testing (with sharing weights), which is yet another way to achieve what you want.

sygi
  • 4,557
  • 2
  • 32
  • 54
  • Thanks sygi, yes, i prefer to move away from placeholders for the current project. I am dealing with images files that come in all kinds of shapes and sizes, so i cannot easily import those into a numpy array. I have to do data preprocessing and resizing on them. Prefetching the images using Queues and using tensorflows image preprocessing functions makes this more manageable. – ronrest Dec 16 '16 at 11:35
  • For some reason, i havent been able to get the `tf.cond()` method to work for me. Though i am sure it is a silly mistake in my code. I will definitely look into the method using shared weights, though changing the rest of my code to work with shared weights properly might open a whole new can of worms i am not ready to deal with at the moment. For now i have got a solution working with `QueueBase.from_list()`, though i suspect that your suggestion of using shared weights might be a much better solution. – ronrest Dec 16 '16 at 11:40
3

Ok, so i have got a solution that is working for me. It is based on code taken from this post on the tensorflow github issues section. It makes use of the QueueBase.from_list() function. It feels very hacky, and I am not entirely happy with it, but at least i am getting it to work.

import tensorflow as tf

# DATA
train_items = ["train_file_{}".format(i) for i in range(6)]
valid_items = ["valid_file_{}".format(i) for i in range(3)]

# SETTINGS
batch_size = 3
batches_per_epoch = 2
epochs = 2

# ------------------------------------------------
#                                            GRAPH
# ------------------------------------------------
graph = tf.Graph()
with graph.as_default():
    # TRAIN QUEUE
    train_q = tf.train.string_input_producer(train_items, shuffle=False)

    # VALID/TEST QUEUE
    test_q = tf.train.string_input_producer(valid_items, shuffle=False)

    # SELECT QUEUE
    is_training = tf.placeholder(tf.bool, shape=None, name="is_training")
    q_selector = tf.cond(is_training,
                         lambda: tf.constant(0),
                         lambda: tf.constant(1))

    # select_q = tf.placeholder(tf.int32, [])
    q = tf.QueueBase.from_list(q_selector, [train_q, test_q])

    # # Create batch of items.
    data = q.dequeue_many(batch_size)


# ------------------------------------------------
#                                          SESSION
# ------------------------------------------------
with tf.Session(graph=graph) as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())

    # Start populating the queue.
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)


    try:
        for epoch in range(epochs):
            print("-" * 60)
            # TRAIN
            for step in range(batches_per_epoch):
                if coord.should_stop():
                    break
                print("TRAIN.dequeue = " + str(sess.run(data, {is_training: True})))

            # VALIDATION
            print "\nVALID.dequeue = " + str(sess.run(data, {is_training: False}))

    except Exception, e:
        coord.request_stop(e)

    finally:
        coord.request_stop()
        coord.join(threads)

Giving the following output, which is what i expected.

------------------------------------------------------------
TRAIN.dequeue = ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN.dequeue = ['train_file_3' 'train_file_4' 'train_file_5']

VALID.dequeue = ['valid_file_0' 'valid_file_1' 'valid_file_2']
------------------------------------------------------------
TRAIN.dequeue = ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN.dequeue = ['train_file_3' 'train_file_4' 'train_file_5']

VALID.dequeue = ['valid_file_0' 'valid_file_1' 'valid_file_2']

I am leaving this thread open in the hopes that a better solution comes along.

ronrest
  • 1,192
  • 10
  • 17
  • do you by any chance know a better way to handle this? It's been a year since you posted this and I still can't find a decent way to do this... – MoneyBall Feb 24 '18 at 02:14
2

Creating two different queues is discouraged.

If you have two different machines I would recommend using separate machines for training and validation (if no, you can use two different processes). For 2 machine cases:

  1. First machine has only training data. It uses queues to pass the data in batches to the graph model and has GPU for training. After each step it saves the new model (model_iteration) somewhere where second machine can access it.
  2. Second machine (has only validation data) periodically polls the place with the model and check if the new model is available. In this case it runs an inference of the new model and checks the performance. Because most of the time the validation data is significantly smaller than the training data, you can even afford to have it all in memory.

Few pros of this approach. Training/validation data is separate and you can't mess up with them. You can have a weak machine for validation because even if the validation is lagging behind the training (unlikely scenario) it is not a problem because they are independent

Community
  • 1
  • 1
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
  • 1
    Is there a better way to run CV? I don't like to run it on a separate machine/session.... Why such a simple thing is this much complicated?!! – Nejla Dec 04 '17 at 04:09