0

I'm having issue to get the loss (training or monitoring) summary showing in tensorboard using skflow

This is my code:

classifier = skflow.TensorFlowEstimator(  model_fn=conv_model, n_classes=2, batch_size=BATCH_SIZE, steps=100000, learning_rate=0.001, config=RunConfig(gpu_memory_fraction=0.9))

val_monitor = monitors.ValidationMonitor(X_val, y_val, n_classes=2, print_steps=100)

classifier.fit(X_train, y_train, val_monitor, logdir='my_model_1/')

classifier.save('my_model_1/')

Everything runs well:

`I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/io/data_feeder.py:281: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
  out.itemset((i, self.y[sample]), 1.0)
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 980
major: 5 minor: 2 memoryClockRate (GHz) 1.253
pciBusID 0000:03:00.0
Total memory: 4.00GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:03:00.0)
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/io/data_feeder.py:370: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  out.itemset((i, y), 1.0)
Step #99, avg. train loss: 2.22587, avg. val loss: 2.14521
Step #199, avg. train loss: 0.82641, avg. val loss: 0.89103
Step #299, avg. train loss: 0.78344, avg. val loss: 0.85636
Step #399, avg. train loss: 0.76420, avg. val loss: 0.85675
Step #499, avg. train loss: 0.75868, avg. val loss: 0.84104
Step #599, avg. train loss: 0.75467, avg. val loss: 0.84945
Step #699, avg. train loss: 0.73990, avg. val loss: 0.91238
Step #799, avg. train loss: 0.73400, avg. val loss: 0.92720
Step #899, avg. train loss: 0.72879, avg. val loss: 0.91054
Step #999, avg. train loss: 0.73448, avg. val loss: 0.89823
Step #1099, avg. train loss: 0.70125, avg. val loss: 0.91640
Step #1199, avg. train loss: 0.71879, avg. val loss: 0.90597
Step #1299, avg. train loss: 0.70713, avg. val loss: 0.90736
Step #1399, avg. train loss: 0.70023, avg. val loss: 0.91414
Step #1499, avg. train loss: 0.69566, avg. val loss: 0.91007
Step #1599, avg. train loss: 0.68030, avg. val loss: 0.92729
Step #1699, avg. train loss: 0.68919, avg. val loss: 0.91168
Step #1799, avg. train loss: 0.67088, avg. val loss: 0.91744
Step #1899, avg. train loss: 0.68732, avg. val loss: 0.88844
Step #1999, avg. train loss: 0.67585, avg. val loss: 0.88854`

it generates the file .tfevents that have 4,8M size (attached)

when I connect to the machine using chrome as explorer I have data in graphs/histograms/ but nothing in events (No scalar data was found)

did I miss something to have loss logged ?

NB:I added logging_ops.scalar_summary("model_loss", self._model_loss) in learn/python/learn/estimators/base.py and the model-loss is appearing in tensorboard

Ps: I'm running on GPU machine using the last build tensorflow attached tfevents my_model_1.zip

P. Camilleri
  • 12,664
  • 7
  • 41
  • 76
laouer
  • 1

1 Answers1

0

It was an issue in skflow corrected here, and also for monitoring validation here

laouer
  • 1