TensorFlow custom estimator stuck when calling evaluate after training

Question

I made a custom estimator (see this colab) in TensorFlow (v1.10) based on their guide.

I trained the toy model with:

tf.estimator.train_and_evaluate(est, train_spec, eval_spec)

and then, with some test set data, try to evaluate the model with:

test_fn = lambda: input_fn(DATASET['test'], run_params)
test_res = est.evaluate(input_fn=test_fn)

(where the train_fn and valid_fn are functionally identical to test_fn, e.g. sufficient for tf.estimator.train_and_evaluate to work).

I would expect something to happen, however this is what I get:

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-11-09-13:38:44
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./test/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

and then it just runs forever.

How come?

score 8 · Accepted Answer · answered Nov 09 '18 at 14:16

8

This is because you repeat the dataset indefinitely:

# In input_fn
dataset = dataset.repeat().batch(batch_size)

By default, estimator.evaluate() runs until the input_fn raises an end-of-input exception. Because you repeat the test dataset indefinitely, it never raises the exception and keeps running.

You can either remove the repeat when testing, or run the evaluation for a given number of steps using the 'steps' argument as it is used in your original 'eval_spec'.

answered Nov 09 '18 at 14:16

Olivier Dehaene

1,620
11
15

Removing the repeat works and I think I understand why steps is required when using `repeat`. I have updated the colab and tried your other suggestion, removing repeat, but this raises an error. Could you please assist? – SumNeuron Nov 11 '18 at 09:45
It's because of this "labels.set_shape(O_SHAPE(params['batch_size']))". Just remove it. labels can have a different size since the number of inputs might not be dividable by batch_size meaning that the last batch might be have less data. – Olivier Dehaene Nov 11 '18 at 14:16
Removing `"labels.set_shape(O_SHAPE(params['batch_size']))` doesn't fix the issue and raises an error... >`InvalidArgumentError (see above for traceback): Inputs to operation loss_fn/logistic_loss/Select of type Select must have the same size and shape. Input 0: [10,20,4] != input 1: [1,20,4]` ... – SumNeuron Nov 11 '18 at 19:31
change inet.set_shape(I_SHAPE(params['batch_size'])) to inet.set_shape(I_SHAPE(None)) for the same reason. – Olivier Dehaene Nov 12 '18 at 09:29
Tried that as well prior to posting my comment, this causes a similar error. One has to set the shape because tf records are the worse – SumNeuron Nov 12 '18 at 09:53
It works on my side using your colab. Are you sure you changed the one inside your build_fn ? – Olivier Dehaene Nov 12 '18 at 10:23
Foremost I want to state my gratitude for you extensive assistance, if you I updated the `run_params` to have `"batch_size": None` but this raises `ValueError: None values not supported.` before the model can even get off the ground – SumNeuron Nov 12 '18 at 13:00
any ideas on how to get around this? – SumNeuron Nov 14 '18 at 08:40
could you maybe look at this: https://stackoverflow.com/questions/53307954/tensorflow-custom-estimator-predict-throwing-value-error – SumNeuron Nov 15 '18 at 09:59

TensorFlow custom estimator stuck when calling evaluate after training

1 Answers1

Linked