1

Recently, I have started working with 1 Tesla T4 GPU with 12 vCpu and 60 GB RAM. I am training an Seq2Seq bidirectional LSTM with attention layer and have 38,863,916 training parameters. While training my Seq2Seq model I am getting following errors `GPU sync failed. I searched for the error and got to know it means my gpu memory is full. Following is my code

encoder_inputs = Input(shape=(max_x_len,))

emb1 = Embedding(len(x_voc), 100, weights=[x_voc], trainable = False)(encoder_inputs)

encoder = Bidirectional(LSTM(latent_dim, return_state=True, return_sequences =True))
encoder_outputs0, _, _, _, _ = encoder(emb1)

encoder = Bidirectional(LSTM(latent_dim, return_state=True, return_sequences =True))
encoder_outputs2, forward_h, forward_c, backward_h, backward_c = encoder(encoder_outputs0)

encoder_states = [forward_h, forward_c, backward_h, backward_c]


### Decoder
decoder_inputs = Input(shape=(None,))

emb2 = Embedding(len(y_voc), 100, weights=[y_voc], trainable = False)(decoder_inputs)

decoder_lstm = Bidirectional(LSTM(latent_dim, return_sequences=True, return_state=True))
decoder_outputs2, _, _, _, _ = decoder_lstm(emb2, initial_state=encoder_states)


attn_layer = AttentionLayer(name='attention_layer')
attn_out, attn_states = attn_layer([encoder_outputs2, decoder_outputs2]) 
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs2, attn_out])

decoder_dense = Dense(len(y_voc), activation='softmax')
decoder_outputs = decoder_dense(decoder_concat_input)


model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

model.fit([x_inc,x_dec],y_dec, batch_size = 32, epochs=500)
x_inc.shape => (1356, 433)
x_dec.shape =>(1356, 131)
y_dec.shape =>(1356, 131, 10633)
hR 312
  • 824
  • 1
  • 9
  • 22
  • 38MB of training parameters is consuming your GPU's VRAM? Hmm... are any other processes taking your GPU? You can check with `gpustat` or `nvidia-smi`. – Mateen Ulhaq Sep 27 '19 at 09:27
  • Maybe you left another tensorflow instance open. – Mateen Ulhaq Sep 27 '19 at 09:28
  • @MateenUlhaq: OP didn't say 38MB, but 38 millions. So just the input vector for one instance is at least 4*38 Mb; no idea about other layers, or the weight matrices, or if batch learning is used. It may be as you said, or it may be that the graph is actually huge; we can't know from the above. – Amadan Sep 27 '19 at 09:33
  • Kindly let me know what all should I add in question? – hR 312 Sep 27 '19 at 09:35
  • Why a downvote? – hR 312 Sep 27 '19 at 09:46
  • 1
    In the TF log you should see the GPU device initialization, which contains also the available GPU memory. Check that you do in fact have enough GPU memory available then (e.g., if another process is using the GPU you won't have the full 60GBs and, especially if the other process is a TF Session, typically it will hoard all the available memory). If that's still the case, lowering the batch size is your only way to go further (while keeping the same model, that is).. – GPhilo Sep 27 '19 at 10:48
  • @hR 312, Can you please confirm if the error is resolved with the comment mentioned above? –  Jun 11 '20 at 10:32
  • @hR312, Can you please confirm if your issue is resolved? If your issue is not resolved, try reducing the Batch Size to 16 or 8 or finally to 1. Please let us know how it goes. Thanks! –  Jun 19 '20 at 07:00
  • Yes it works with reduced batch size – hR 312 Jun 23 '20 at 23:25

0 Answers0