10

This Keras model seems to require 6GB+ of RAM using the Tensorflow backend. My back-of-the-envelope math suggests that storing the weights shouldn't require more than 500MB. What's going on?

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D

IMAGE_SIZE = 128
print('Build model...')
model = Sequential()
# three color channels, 128x128
# 16 con filters, 3 rows, 3 columns
model.add(Convolution2D(16, 3, 3, input_shape=(3, IMAGE_SIZE, IMAGE_SIZE)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(1))
model.add(Dense(3 * IMAGE_SIZE * IMAGE_SIZE))


model.compile(loss='mse', optimizer='sgd')

It's a convolution layer (16 3x3 filters) connected to a single neuron, and then that single neuron is connected to ~50k neurons.

I'm pretty new to Keras, so I imagine my misunderstanding is pretty fundamental, but I can't seem to figure it out.

Ryan Marcus
  • 966
  • 8
  • 21
  • Have you tried another backend to narrow down the possibility of bugs? – cfh Mar 02 '16 at 20:18
  • Yeah, I just installed Theano (no tweaking or anything), and the memory usage seems to be just as high. Is it possibly something inside of Keras that's leaking before the backend is even used? I don't know enough about the code... – Ryan Marcus Mar 02 '16 at 20:19
  • Obviously, you need a more accurate envelope. What basis do you have for your 500MB expectation? – msw Mar 02 '16 at 20:20
  • Haha, it sounds like you might have a better understanding of what's going on than I do. Maybe you could share your calculations? Perhaps my misunderstanding is of the `num_filters` (first) parameter of the conv net. My understanding is that there would be `O(filters * pixels)` weights going from the conv net to the single neuron, then `O(pixels)` weights going to the dense layer. – Ryan Marcus Mar 02 '16 at 20:24
  • Nopey, I've got no idea what's happening; sorry if my joke implied otherwise. I presume you know that Big-O notation is a proportionality function which ignores constant factors. If you told me that my estimates of memory usage of a unfamiliar, complex Python package were off by a factor of 12 I'd be wholly unsurprised. – msw Mar 03 '16 at 04:20
  • Do you use floats or doubles? And try to use [model.save](http://keras.io/faq/#how-can-i-save-a-keras-model) to see how much space the model needs when serialized to disk. – cfh Mar 03 '16 at 07:53
  • Or even better, follow [this thread](https://github.com/fchollet/keras/issues/91) and call `layer.get_weights()` on each layer to see exactly what size the weight matrix for each layer is. – cfh Mar 03 '16 at 07:58
  • @msw yeah, maybe that is just the amount of overhead needed... – Ryan Marcus Mar 03 '16 at 17:10
  • @cfh my computer is running out of memory before the model can be compiled, so I cannot call either of those methods. – Ryan Marcus Mar 03 '16 at 17:10
  • Try scaling your model back, say `IMAGE_SIZE = 64`, and try then. – cfh Mar 03 '16 at 17:40
  • Note: the memory consumed by the weights should be 458KB. You can check via `num_bytes = sum(np.prod(x.shape) for x in model.weights) * 4` – Mateen Ulhaq Mar 26 '20 at 09:28

1 Answers1

13

Turns out, my issue was including a path to CUDA 7.5 in my LD_CONFIG_PATH, but including a path to CUDA 7.0 in PATH. Apparently this awkward combination spawns some undefined behavior, which in my case produced a memory leak.

After examining the code with a valgrind, I found that the nvcc from 7.0 was essentially jumping into nonsense areas of the CUDA (7.5) library, which is not unexpected. It's actually pretty amazing it leaked memory instead of just crashing, and that Theano had the same error.

Hopefully no one else will be plagued by this particular issue in the future, but if you are, double check your version paths!

On my local machine, without a GPU'd Tensorflow installed, I still got the memory leak, which appeared to a bug in the previous (0.7.0) version that has been resolved with the (0.7.1) release. Again, I haven't figured out why my non-GPU Theano backend also produced the leak, but after upgrading Tensorflow, the Theano backend doesn't leak either. It's a very strange thing, but I believe the general solution to this problem is "upgrade" and "double-check your env".

Ryan Marcus
  • 966
  • 8
  • 21