2

I was trying to implement a sequence to sequence language model. During training process, the model takes in a sequence of 50d word vectors generated by GloVe, and output 1-to-V(V is the size of vocabulary) vector meaning the next word which thus can be regarded as the distribution of next word corresponding to the input word vector at current timestep in test process, and I tried with a 112-word vocabulary.

Then, I built two models as following:

model1 = Sequential()
model1.add(LSTM(112, return_sequences=True, input_shape=(31, 50)))

model2 = Sequential()
model2.add(LSTM(112, return_sequences=True, input_shape=(31, 50)))
model2.add(TimeDistributed(Dense(112, activation="linear")))

When I tried to fit them by

model.fit(X, Y, batch_size=128, nb_epoch=256, validation_rate=0.1)

The first model model1 crashed and raised MemoryError, but the second model model2 normally finished. X has the shape of (number_of_sentences, max_words_in_one_sentence, 50), and Y has the shape of (number_of_sentences, max_words_in_one_sentence, 112). In this example, number_of_sentences=10000, max_words_in_one_sentence=13.

I am wondering what happened when I appended a new time-distributed-dense to a LSTM layer, and which one is the model I want to implement my language model.

Cœur
  • 37,241
  • 25
  • 195
  • 267
高剑飞
  • 21
  • 1

1 Answers1

0

What happened is that your computing device (probably a GPU) ran out of memory. I suspect it is a NVIDIA card (due to lack of alternatives), so check the output of nvidia-smi to see if you run into memory issues.

Depending on the backend (Theano or TensorFlow) you may experience different behaviours in memory usage, so switching backends may be a solution in some cases.

In case you are using Theano the issue might be the TimeDistributed wrapper. TimeDistributed does this when no batch size is specified:

K.reshape(x, (-1,) + input_shape[2:])

so it basically reshapes x from (batchsz,timesteps,units) to (batchsz*timesteps,units). However, reshape is currently allocating a new array if the reshaped one is not C-contiguous (data is sorted as a n-dimensional C array would be), which I suspect it is not in your case.

What you can try is to specify a fixed batch size for your input in which case TimeDistributed will work through the input sequentially using K.rnn without allocation as much memory.

nemo
  • 55,207
  • 13
  • 135
  • 135