2

I'm using AWS p2.x8large and trying to evaluate my model using k-fold cross validation. After the first repetition my GPUs memory is full and when I try to train once again I receive a cuda memory problem.

My question is how to reset the GPU memory within the loop? I used K.clear_session() and also gc.collect() but none of them worked.

the error message:

> MXNetError                                Traceback (most recent call
> last) ~/anaconda3/lib/python3.6/site-packages/mxnet/symbol.py in
> simple_bind(self, ctx, grad_req, type_dict, group2ctx,
> shared_arg_names, shared_exec, shared_buffer, **kwargs)    1472       
> shared_exec_handle,
> -> 1473                                                  ctypes.byref(exe_handle)))    1474         except MXNetError as e:
> 
> ~/anaconda3/lib/python3.6/site-packages/mxnet/base.py in
> check_call(ret)
>     128     if ret != 0:
> --> 129         raise MXNetError(py_str(_LIB.MXGetLastError()))
>     130 
> 
> MXNetError: [19:24:04] src/storage/./pooled_storage_manager.h:102:
> cudaMalloc failed: out of memory
> 
> Stack trace returned 10 entries: [bt] (0)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x1d57cc)
> [0x7f55ce9fe7cc] [bt] (1)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x1242238)
> [0x7f55cfa6b238] [bt] (2)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x1244c0a)
> [0x7f55cfa6dc0a] [bt] (3)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe4d4db)
> [0x7f55cf6764db] [bt] (4)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe549cd)
> [0x7f55cf67d9cd] [bt] (5)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe59f95)
> [0x7f55cf682f95] [bt] (6)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe5d6ee)
> [0x7f55cf6866ee] [bt] (7)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe5dcd4)
> [0x7f55cf686cd4] [bt] (8)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXExecutorSimpleBind+0x2261)
> [0x7f55cf605291] [bt] (9)
> /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)
> [0x7f560d6c4ec0]
> 
> 
> During handling of the above exception, another exception occurred:
> 
> RuntimeError                              Traceback (most recent call
> last) <ipython-input-4-0720b69f15af> in <module>()
>      33 if val_batches.n>0:
>      34     hist = model.fit_generator(generator=train_gen, samples_per_epoch=batches.n,
> ---> 35         nb_epoch=epochs, verbose=True, validation_data=val_gen, nb_val_samples=val_batches.n,
> callbacks=callbacks)
>      36 else:
>      37     model.fit_generator(generator=train_gen, samples_per_epoch=batches.n,
> 
> ~/anaconda3/lib/python3.6/site-packages/Keras-1.2.2-py3.6.egg/keras/engine/training.py
> in fit_generator(self, generator, samples_per_epoch, nb_epoch,
> verbose, callbacks, validation_data, nb_val_samples, class_weight,
> max_q_size, nb_worker, pickle_safe, initial_epoch)    1557            
> outs = self.train_on_batch(x, y,    1558                              
> sample_weight=sample_weight,
> -> 1559                                                class_weight=class_weight)    1560     1561                     if not
> isinstance(outs, list):
> 
> ~/anaconda3/lib/python3.6/site-packages/Keras-1.2.2-py3.6.egg/keras/engine/training.py
> in train_on_batch(self, x, y, sample_weight, class_weight)    1320    
> ins = x + y + sample_weights    1321        
> self._make_train_function()
> -> 1322         outputs = self.train_function(ins)    1323         if len(outputs) == 1:    1324             return outputs[0]
> 
> ~/anaconda3/lib/python3.6/site-packages/Keras-1.2.2-py3.6.egg/keras/engine/training.py
> in train_function(inputs)    1952         def
> _make_train_function(self):    1953             def train_function(inputs):
> -> 1954                 data, label, _, data_shapes, label_shapes = self._adjust_module(inputs, 'train')    1955     1956                
> batch = K.mx.io.DataBatch(data=data, label=label, bucket_key='train',
> 
> ~/anaconda3/lib/python3.6/site-packages/Keras-1.2.2-py3.6.egg/keras/engine/training.py
> in _adjust_module(self, inputs, phase)    1908             if not
> self._mod.binded:    1909                
> self._mod.bind(data_shapes=data_shapes, label_shapes=None,
> -> 1910                                for_training=True)    1911                 self._set_weights()    1912                
> self._mod.init_optimizer(kvstore=self._kvstore,
> optimizer=self.optimizer)
> 
> ~/anaconda3/lib/python3.6/site-packages/mxnet/module/bucketing_module.py
> in bind(self, data_shapes, label_shapes, for_training,
> inputs_need_grad, force_rebind, shared_module, grad_req)
>     322                         state_names=self._state_names)
>     323         module.bind(data_shapes, label_shapes, for_training, inputs_need_grad,
> --> 324                     force_rebind=False, shared_module=None, grad_req=grad_req)
>     325         self._curr_module = module
>     326         self._curr_bucket_key = self._default_bucket_key
> 
> ~/anaconda3/lib/python3.6/site-packages/mxnet/module/module.py in
> bind(self, data_shapes, label_shapes, for_training, inputs_need_grad,
> force_rebind, shared_module, grad_req)
>     415                                                      fixed_param_names=self._fixed_param_names,
>     416                                                      grad_req=grad_req,
> --> 417                                                      state_names=self._state_names)
>     418         self._total_exec_bytes = self._exec_group._total_exec_bytes
>     419         if shared_module is not None:
> 
> ~/anaconda3/lib/python3.6/site-packages/mxnet/module/executor_group.py
> in __init__(self, symbol, contexts, workload, data_shapes,
> label_shapes, param_names, for_training, inputs_need_grad,
> shared_group, logger, fixed_param_names, grad_req, state_names)
>     229         self.num_outputs = len(self.symbol.list_outputs())
>     230 
> --> 231         self.bind_exec(data_shapes, label_shapes, shared_group)
>     232 
>     233     def decide_slices(self, data_shapes):
> 
> ~/anaconda3/lib/python3.6/site-packages/mxnet/module/executor_group.py
> in bind_exec(self, data_shapes, label_shapes, shared_group, reshape)
>     325             else:
>     326                 self.execs.append(self._bind_ith_exec(i, data_shapes_i, label_shapes_i,
> --> 327                                                       shared_group))
>     328 
>     329         self.data_shapes = data_shapes
> 
> ~/anaconda3/lib/python3.6/site-packages/mxnet/module/executor_group.py
> in _bind_ith_exec(self, i, data_shapes, label_shapes, shared_group)
>     601                                            type_dict=input_types, shared_arg_names=self.param_names,
>     602                                            shared_exec=shared_exec,
> --> 603                                            shared_buffer=shared_data_arrays, **input_shapes)
>     604         self._total_exec_bytes += int(executor.debug_str().split('\n')[-3].split()[1])
>     605         return executor
> 
> ~/anaconda3/lib/python3.6/site-packages/mxnet/symbol.py in
> simple_bind(self, ctx, grad_req, type_dict, group2ctx,
> shared_arg_names, shared_exec, shared_buffer, **kwargs)    1477       
> error_msg += "%s: %s\n" % (k, v)    1478             error_msg += "%s"
> % e
> -> 1479             raise RuntimeError(error_msg)    1480     1481         # update shared_buffer
> 
> RuntimeError: simple_bind error. Arguments: input_1_1: (64, 3, 224,
> 224) [19:24:04] src/storage/./pooled_storage_manager.h:102: cudaMalloc
> failed: out of memory
> 
> Stack trace returned 10 entries: [bt] (0)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x1d57cc)
> [0x7f55ce9fe7cc] [bt] (1)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x1242238)
> [0x7f55cfa6b238] [bt] (2)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x1244c0a)
> [0x7f55cfa6dc0a] [bt] (3)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe4d4db)
> [0x7f55cf6764db] [bt] (4)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe549cd)
> [0x7f55cf67d9cd] [bt] (5)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe59f95)
> [0x7f55cf682f95] [bt] (6)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe5d6ee)
> [0x7f55cf6866ee] [bt] (7)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xe5dcd4)
> [0x7f55cf686cd4] [bt] (8)
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXExecutorSimpleBind+0x2261)
> [0x7f55cf605291] [bt] (9)
> /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)
> [0x7f560d6c4ec0]

enter image description here

talonmies
  • 70,661
  • 34
  • 192
  • 269
Jenia Golbstein
  • 374
  • 2
  • 12

1 Answers1

0

Using gc.collect I was able to limit the GPU memory footprint twice that of a single run. Without it, the memory footprint kept on increasing. I created a function for the training and evaluation, which allowed the gc to clear up the model after the result was returned.

cv_results = []
for train, test in cv_folds:
    result = train_and_eval(train, test)
    cv_results.append(result)
    gc.collect()

Since the footprint is still double, you should look into reducing the batch size to compensate for this. You should then be able to fit everything into GPU memory.

It's worth noting that MXNet doesn't actually deallocate memory from the GPU, it adds it back into its internal memory pool for future use. So although the GPU memory usage may still look high in nvidia-smi, the memory is still free to use for MXNet. As with most computation steps, the garbage collection of this memory happens asynchronously.

If you can't keep two runs worth of memory on your GPU, you can always start subprocesses as mentioned by geoalgo, using something similar to;

from subprocess import Popen, PIPE, STDOUT
import json

def eval_on_fold(indicies):
    indicies_str = json.dumps(indicies)
    p = Popen(['python', 'train_and_eval.py', '--cv-indicies'], stdout=PIPE, stdin=PIPE, stderr=PIPE)
    eval_metric_str = p.communicate(input=indicies_str)[0]
    eval_metric = float(eval_metric_str)
    return eval_metric

cv_results = []
for train, test in cv_folds:
    indicies = {'train': train, 'test': test}
    eval_metric = eval_on_fold(indicies)
    cv_results.append(eval_metric)
MBT
  • 21,733
  • 19
  • 84
  • 102
Thom Lane
  • 993
  • 9
  • 9