I want to train a language model in Keras, by this tutorial: https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/
My input is composed of: lines num: 4823744 maximum line: 20 Vocabulary Size: 790609 Total Sequences: 2172328 Max Sequence Length: 11
As you can see by this lines:
num_words = 50
tokenizer = Tokenizer(num_words=num_words, lower=True)
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
I'm using the tokenizer with num_words=50. The vocab_size is taken from the tokenizer, but it's still the bigger size (790K).
Therefore this line:
y = to_categorical(y, num_classes=vocab_size)
Causes a memory error.
This is the model definition:
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
How can I deal with it?
I do want to have word-level model and not char-level. And I do want to take at least 10K of the most common words.
I thought about filtering words before hand, but it may cause the language model to learn false sequences.
How can I solve it?
Thanks