0

I'm trying to use a validation monitor in skflow by passing my validation set as numpy array.

Here is some simple code to reproduce the problem (I installed tensorflow from the provided binaries for Ubuntu/Linux 64-bit, GPU enabled, Python 2.7):

import numpy as np
from sklearn.cross_validation import train_test_split
from tensorflow.contrib import learn
import tensorflow as tf
import logging
logging.getLogger().setLevel(logging.INFO)

#Some fake data   
N=200
X=np.array(range(N),dtype=np.float32)/(N/10)
X=X[:,np.newaxis]
Y=np.sin(X.squeeze())+np.random.normal(0, 0.5, N)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,  
                                                    train_size=0.8,  
                                                    test_size=0.2)

val_monitor = learn.monitors.ValidationMonitor(X_test, Y_test,early_stopping_rounds=200)
reg=learn.DNNRegressor(hidden_units=[10,10],activation_fn=tf.tanh,model_dir="tmp/")
reg.fit(X_train,Y_train,steps=5000,monitors=[val_monitor])
print "train error:", reg.evaluate(X_train, Y_train)
print "test error:", reg.evaluate(X_test, Y_test)

The code runs but only the first validation step is done properly, then validation always returns the same value even if training is actually going fine which can be checked by running an evaluation on the test set at the end. The following message also appears for each validation step.

INFO:tensorflow:Input iterator is exhausted. 

Any help is welcome! Thanks, David

dbikard
  • 491
  • 1
  • 4
  • 13
  • What's your data look like? You need to pay more attention to `every_n_steps`, `steps`, and `batch_size` when you are using it. – Yuan Tang Jul 21 '16 at 17:09
  • I've now edited my question to provide an example. I cannot call ValidationMonitor with `batch_size` and `steps` as keyword arguments. I installed tensorflow from the provided binaries for Ubuntu/Linux 64-bit, GPU enabled, Python 2.7. Maybe the code for the monitors was changed recently? – dbikard Jul 25 '16 at 11:59
  • Yeah it's changed quite a bit. Please try latest version. – Yuan Tang Jul 25 '16 at 16:52
  • 1
    I've now build tensorflow from source but I still run into the same problem. I now get this message in the logs: `INFO:tensorflow:Skipping evaluation due to same checkpoint tmp/model.ckpt-0-?????-of-00001 for step 200 as for step 100.` – dbikard Jul 26 '16 at 11:07

2 Answers2

0

I was able to solve this by adding: config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1) to the DNNRegressor call.

dbikard
  • 491
  • 1
  • 4
  • 13
0

Improving on dbikard's solution:

Add config=tf.contrib.learn.RunConfig(save_checkpoints_steps=val_monitor._every_n_steps) to the DNN Regressor call instead.

This saves checkpoints when they are needed (i.e each time before the monitor is triggered) rather than once per second.

tinu
  • 11
  • 2