6

I am trying to run a hyperparameter optimization (using spearmint) on a big network with lots of trainable variables. I am worried that when I try a network with the number of hidden units too large, the Tensorflow will throw a GPU memory error.

I was wondering if there is a way of catching the GPU memory error thrown by Tensorflow and skip the batch of hyperparameters that causes the memory error.

For example, I would like something like

import tensorflow as tf 

dim = [100000,100000]
X   = tf.Variable( tf.truncated_normal( dim, stddev=0.1 ) )

with tf.Session() as sess:
    try:
        tf.global_variables_initializer().run()
    except Exception as e :
        print e

When I try above to test the memory error exception, the code breaks and just prints the GPU memory error and does not progress to the except block.

unknown_jy
  • 757
  • 9
  • 19
  • 1
    maybe your version is too old? I just [tried](https://github.com/yaroslavvb/stuff/blob/master/gpu_oom.py) in latest version, and it's caught on python side successfully – Yaroslav Bulatov Jan 30 '17 at 18:12

1 Answers1

0

Try this :

import tensorflow as tf

try:
    with tf.device("gpu:0"):
        a = tf.Variable(tf.ones((10000, 10000)))
        sess = tf.Session()
        sess.run(tf.initialize_all_variables())
except:
    print("Caught error")
    import pdb; pdb.set_trace()

source : https://github.com/yaroslavvb/stuff/blob/master/gpu_oom.py

H4k333m
  • 54
  • 4