0

I'm trying to run a simple sequential Tensorflow model with the Keras wrapper on an AMD GPU (AMD Vega 20, Tensorflow 2.2.0, Keras 2.4.3), but coming up against a weird issue when trying to fit:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 15 values, but the requested shape has 15976860750

It seems to be taking the batch size as the number of values for the input tensor, and somehow the size of the "requested shape" explodes. The model definition is as follows:

def create_model(optimizer='rmsprop', init='glorot_uniform'):
    # create model
    model = Sequential()
    model.add(Dense(12,input_dim=8, kernel_initializer=init, activation='relu'))
    model.add(Dense(8, kernel_initializer=init, activation='relu'))
    model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
    return model
modelClf = KerasClassifier(build_fn=create_model, verbose=1, batch_size=15, epochs=9)

The exact same model works fine if I run it on just CPU, on a machine that has no GPU installed. It also works well with the CUDA11 implementation on another machine running an NVidia GPU (with Tensorflow 1.15.3 and Keras 2.3.1).

I have no idea why it would be requesting the GPU memory size as the input size on this later Tensorflow version, and only if an AMD GPU is present. Is there something obvious that I might be getting wrong with the configuration here?

EDIT: In response to comments below, after some tweaking the "requested size" is somehow related to the batch size and not the GPU memory as thought (the number was apparently a coincidence - setting the batch size to 10 gives a "requested size" of 1092616192 instead). The input is just a simple panda dataframe with 8 values in each row (as defined with the input_dim, and as mentioned this works fine with the same implementation on other machines).

The error occurs during the call to fit() for the training - I can see from the output that it gets about 5 epochs in before crashing like this. The traceback is: (with "~/rocm/keras" just being the path to where i have the python packages installed for this environment)

    File "~/rocm/keras/tensorflow/python/keras/wrappers/scikit_learn.py", line 223, in fit
      return super(KerasClassifier, self).fit(x, y, **kwargs)
    File "~/rocm/keras/tensorflow/python/keras/wrappers/scikit_learn.py", line 166, in fit
         history = self.model.fit(x, y, **fit_args)
    File "~/rocm/keras/tensorflow/python/keras/engine/training.py", line 66, in _method_wr    apper
      return method(self, *args, **kwargs)
    File "~/rocm/keras/tensorflow/python/keras/engine/training.py", line 848, in fit
      tmp_logs = train_function(iterator)
    File "~/rocm/keras/tensorflow/python/eager/def_function.py", line 580, in __call__
      result = self._call(*args, **kwds)
    File "~/rocm/keras/tensorflow/python/eager/def_function.py", line 611, in _call
      return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
    File "~/rocm/keras/tensorflow/python/eager/function.py", line 2420, in __call__
      return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
    File "~/rocm/keras/tensorflow/python/eager/function.py", line 1665, in _filtered_call
      self.captured_inputs)
    File "~/rocm/keras/tensorflow/python/eager/function.py", line 1746, in _call_flat
      ctx, args, cancellation_manager=cancellation_manager))
    File "~/rocm/keras/tensorflow/python/eager/function.py", line 598, in call
      ctx=ctx)
    File "~/rocm/keras/tensorflow/python/eager/execute.py", line 60, in quick_execute
      inputs, attrs, num_outputs)
  tensorflow.python.framework.errors_impl.InvalidArgumentError:  Input to reshape is a tensor with 10 values, but the requested shape has 1092616192
Jez W
  • 101
  • 6
  • What is the shape of your input data? and which line is throwing the error? It would be good it you can provide the stack trace or minimal reproducible example. – Ashwin Geet D'Sa Aug 11 '20 at 09:43
  • @Dr.Snoopy The number 15 is not what I'm assuming is the size of VRAM here, the "15976860750" is (the GPU reports as having "deviceMemorySize: 15.98GiB" so this seems the only reasonable source for a number just under 16e9) – Jez W Aug 11 '20 at 10:30
  • This sounds more like a bug in TensorFlow RoCM than something in Keras or TensorFlow itself. – Dr. Snoopy Aug 11 '20 at 10:32
  • @AshwinGeetD'Sa I've updated the post with the stack trace and the other details - it's crashing with this about 5 epochs into the fit() . The input is just a simple pandas array, with 8 inputs on each row and ~720k entries. – Jez W Aug 11 '20 at 12:57

0 Answers0