lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU

Question

I am running the following code for LSTM on Databricks with GPU

model = Sequential()
model.add(LSTM(64, activation=LeakyReLU(alpha=0.05), batch_input_shape=(1, timesteps, n_features), 
    stateful=False, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(32))
model.add(Dropout(0.2))
model.add(Dense(n_features))
model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate = 0.001), metrics='acc')
model.fit(generator, epochs=epochs, verbose=0, shuffle=False)

but the following warning keeps appearing

WARNING:tensorflow:Layer lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.

It trains much slower than it does without a GPU. I'm using DBR 9.0 ML (includes Apache Spark 3.1.2, GPU, Scala 2.12) Do I need any additional libraries for this?

are you using the built-in tensorflow or installing it yourself? — Alex Ott, Aug 19 '21 at 10:01

score 30 · Answer 1 · answered Aug 19 '21 at 10:23

CUDNN has functionality to specifically accelerate LSTM and GRU layers. These GRU/LSTM layers can only be accelerated if they meet a certain criteria. In your case the problem is that you are using the LeakyReLU activation. The CUDNN LSTM acceleration only works if the activation is tanh.

Quoting from the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)

The requirements to use the cuDNN implementation are:

activation == tanh
recurrent_activation == sigmoid
recurrent_dropout == 0
unroll is False
use_bias is True
Inputs, if use masking, are strictly right-padded.
Eager execution is enabled in the outermost context.

Your LSTM should still run on the gpu but it will be constructed using scan and matmul operations and therefore be much slower. From my experience the CUDNN LSTM/GRU acceleration works so well that both these layers run faster then the SimpleRNN layer (which is not accelerated by CUDNN) despite this layer being much simpler.

Thanks! I modified my code and the warning disappeared but it is still not any faster than running it on a CPU — Muhammad Haris Choudhary, Aug 20 '21 at 02:28
Maybe your batch size is too small to take advantage of the gpu. An single lstm with 32 cells is very small. Take a look at your gpu utilization when you run it. It's probably very low. — chasep255, Aug 20 '21 at 12:14
so you will say its always better to apply CUDNN kernel than simple RNN?? — Kenry Sanchez, Jun 11 '23 at 22:00

score 6 · Answer 2 · answered Mar 18 '22 at 08:21

This is what Francois Chollet, creator of keras library, main contributor of tensorflow framework, said about RNN runtime performance in his book Deep Learning with Python 2nd edition

Recurrent models with very few parameters, like the ones in this chapter, tend to be significantly faster on a multicore CPU than on GPU, because they only involve small matrix multiplications, and the chain of multiplications is not well parallelizable due to the presence of a for loop. But larger RNNs can greatly benefit from a GPU runtime.

When using a Keras LSTM or GRU layer on GPU with default keyword arguments, your layer will be leveraging a cuDNN kernel, a highly optimized, low-level, NVIDIA-provided implementation of the underlying algorithm. As usual, cuDNN kernels are a mixed blessing: they’re fast, but inflexible—if you try to do anything not supported by the default kernel, you will suffer a dramatic slow- down, which more or less forces you to stick to what NVIDIA happens to provide. For instance, recurrent dropout isn’t supported by the LSTM and GRU cuDNN kernels, so adding it to your layers forces the runtime to fall back to the regular TensorFlow implementation, which is generally two to five times slower on GPU (even though its computational cost is the same).

As a way to speed up your RNN layer when you can’t use cuDNN, you can try unrolling it. Unrolling a for loop consists of removing the loop and simply inlining its content N times. In the case of the for loop of an RNN, unrolling can help TensorFlow optimize the underlying computation graph. However, it will also considerably increase the memory consumption of your RNN—as such, it’s only viable for relatively small sequences (around 100 steps or fewer). Also, note that you can only do this if the number of timesteps in the data is known in advance by the model (that is to say, if you pass a shape without any None entries to your initial Input()). It works like this:

    inputs = keras.Input(shape=(sequence_length, num_features))
    x = layers.LSTM(32, recurrent_dropout=0.2, unroll=True)(inputs)

lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU

2 Answers2