Alternative to one-hot encoding for output to a model when vocabulary size is very large

Question

I was following this blog. In it he talks about how to build a language model in keras. He shows how to build a simple model in keras.

After separating, we need to one hot encode the output word. This means converting it from an integer to a vector of 0 values, one for each word in the vocabulary, with a 1 to indicate the specific word at the index of the words integer value.

This is so that the model learns to predict the probability distribution for the next word and the ground truth from which to learn from is 0 for all words except the actual word that comes next.

Keras provides the to_categorical() that can be used to one hot encode the output words for each input-output sequence pair.

He uses the following:

y = to_categorical(y, num_classes=vocab_size)

In his case, the vocabulary size is manageable. I am working with vocabulary having size > 100 million. I guess I should not use a one-hot encoding for the output y as done by him. Is there any alternative?

Not sure it is implemented in Keras, but you might be interested in Hierarchical Softmax (some info in this [blog post](http://ruder.io/word-embeddings-softmax/) by Sebasian Ruder). — mcoav, May 31 '18 at 15:02
Just yout of curiosity, how can you possibly have 100M vocabulary? And seconding @mcoav, you should really think about defining the objective function for such a monstrosity. — dedObed, Jun 01 '18 at 21:17

Alternative to one-hot encoding for output to a model when vocabulary size is very large

0 Answers0