How to deal with large vocab_size when training a Language Model in Keras?

Question

I want to train a language model in Keras, by this tutorial: https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/

My input is composed of: lines num: 4823744 maximum line: 20 Vocabulary Size: 790609 Total Sequences: 2172328 Max Sequence Length: 11

As you can see by this lines:

num_words = 50
tokenizer = Tokenizer(num_words=num_words, lower=True)
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1

I'm using the tokenizer with num_words=50. The vocab_size is taken from the tokenizer, but it's still the bigger size (790K).

Therefore this line:

y = to_categorical(y, num_classes=vocab_size)

Causes a memory error.

This is the model definition:

model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))

How can I deal with it?

I do want to have word-level model and not char-level. And I do want to take at least 10K of the most common words.

I thought about filtering words before hand, but it may cause the language model to learn false sequences.

How can I solve it?

Thanks

score 0 · Answer 1 · answered Oct 29 '19 at 07:50

0

Fasttext is a better way to compute embeddings for large vocabularies - it does not need a dictionary entry for every word.

answered Oct 29 '19 at 07:50

chrishmorris

287
1
6

How to deal with large vocab_size when training a Language Model in Keras?

1 Answers1