6

Consider the following two line Python/TensorFlow interactive session:

import tensorflow as tf
s=tf.Session()

If these commands are executed on an Ubuntu Linux 14.04 machine, using Anaconda Python 2.7.13 and TensorFlow r1.3 (compiled from sources), with 32G physical memory and 2 GPUs (a GTX Titan X and a GTX 970) while CUDA_VISIBLE_DEVICES is not set (i.e. both GPUs are visible) the resulting python process has 59.7G of memory allocated! Note that it only actually uses 754M.

If CUDA_VISIBLE_DEVICES=0 (i.e. only the Titan X is visible) then 55.2G is allocated and 137M is in use.

If CUDA_VISIBLE_DEVICES=1 (i.e. only the 970 is visible) then 47.0G is allocated and 325M is in use.

If CUDA_VISIBLE_DEVICES= (i.e. neither GPU is visible) then only 2.5G is allocated and only 131M is in use.

This is a problem in environments where the amount of allocated memory is constrained, e.g. inside a grid engine setup.

Is there any way to limit the amount of main memory that TensorFlow allocates when it is using CUDA?

Update 1

The amount of memory allocated is determined, in these trials, by looking at the VIRT column in htop.

TensorFlow r1.3 is compiled with mostly default configure answers. The only variations are the paths to CUDA and cuDNN. As a result, jemalloc is being used.

Update 2

I've tried recompiling with jemalloc disabled and see the same behaviour.

Daniel Renshaw
  • 33,729
  • 8
  • 75
  • 94
  • May be caused by the CUDA driver issue discussed here: https://stackoverflow.com/questions/11631191/why-does-the-cuda-runtime-reserve-80-gib-virtual-memory-upon-initialization – Daniel Renshaw Sep 08 '17 at 11:13
  • How do you check how much memory is allocated? Also could try running with different allocator -- `sudo apt-get install google-perftools; export LD_PRELOAD="/usr/lib/libtcmalloc.so.4" ` – Yaroslav Bulatov Sep 09 '17 at 19:08
  • Thanks @YaroslavBulatov. I tried using `tcmalloc` but it seemed to made no difference to the behaviour. I've heard of similar behaviour when CUDA is used directly instead of via TensorFlow so I think the best TF could do is, perhaps, configure the CUDA driver differently when it is loaded. – Daniel Renshaw Sep 11 '17 at 15:08
  • Ah, you are looking at virtual memory. I think that's fairly typical on Unix, I often see huge "virt" allocations by processes without any practical downsides – Yaroslav Bulatov Sep 11 '17 at 15:52
  • Unfortunately it has the downside that TF+CUDA process can't run inside execution containers that monitor the virtual memory allocation to determine whether the process is behaving itself and staying within specified constraints, e.g. via `ulimit`. Open Grid Scheduler/Grid Engine is an example of this. – Daniel Renshaw Sep 11 '17 at 16:41
  • hm, I wonder if it's the problem of TensorFlow, or the problem of malloc implementation. One way is perhaps to add some LOG(INFO) [here](https://github.com/tensorflow/tensorflow/blob/2d8da1d9bd4aaf159b65d5b3d567e79fd41ace23/tensorflow/core/platform/posix/port.cc#L98) to see how much TF is actually request to malloc. If it's actually trying to malloc 50GB in your case that's a bug – Yaroslav Bulatov Sep 11 '17 at 16:48
  • I'm pretty sure the memory isn't being malloc'ed by TF. It looks like it's malloc'ed by the CUDA driver (within the python/TF process) for its unified memory feature. So I think the only think TF could do is configure the driver differently, possibly by disabling unified memory which may not be feasible. – Daniel Renshaw Sep 12 '17 at 07:55
  • Maybe try setting smaller value for env var TF_CUDA_HOST_MEM_LIMIT_IN_MB – Yaroslav Bulatov Sep 12 '17 at 15:38
  • Filed issue -- https://github.com/tensorflow/tensorflow/issues/13020 – Yaroslav Bulatov Sep 13 '17 at 20:25
  • Thanks. I tried the `TF_CUDA_HOST_MEM_LIMIT_IN_MB` and I also didn't see a change in behaviour though I wanted to do a more thorough investigating before updating further. – Daniel Renshaw Sep 13 '17 at 21:18

1 Answers1

1

The default behavior of TensorFlow on GPU is to use all the memory available. However, if you want to avoid this behavior, you can specify to the session to dynamically allocate the memory.

From the ConfigProto declaration :

// allow_growth
// If true, the allocator does not pre-allocate the entire specified
// GPU memory region, instead starting small and growing as needed.

In order to do this, pass a ConfigProto object to your session when creating it :

session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth=True
sess = tf.Session(config=session_config)

If you want to limit the amount of memory used, it's up to your batch size and the number of parameters in your model.

Lescurel
  • 10,749
  • 16
  • 39
  • This affects the amount of GPU memory used by TF but my question is about the use of main system memory by TF when CUDA is enabled. – Daniel Renshaw Sep 08 '17 at 15:41
  • @DanielRenshaw i faced the same issue as well, did you find anything to work around this issue ? – Oliver Hu May 15 '18 at 21:18
  • @OliverHu No. Unless running in a memory controlled environment, such as SGE, it doesn't cause much of a problem since the amount of memory actually used is still low enough for me. I've just had to avoid using SGE for TF work. – Daniel Renshaw May 16 '18 at 06:25
  • 1
    @DanielRenshaw Right, for single node training it works fine. But under a multi tenancy environment, we have like 8 jobs running in each node and each job takes like 300 GB VM for us, the machine is blown away. :( – Oliver Hu May 16 '18 at 20:22
  • Have we found a solution to this. tf Session uses insane amount of memory in main system. Can such a thing be filed in tf issues? @OliverHu – sagar_acharya Jun 04 '19 at 10:46
  • Nope, we turned off virtual memory limit in our cluster. – Oliver Hu Jun 05 '19 at 15:59