One hot encoding of 1 million category

Question

For a language model, I have to predict a word for a given sequence of words. My vocabulary contains 1 million words. I'm trying to predict the words from it. I tried to use one hot encoding using keras (to_categorical) for predicted words. But for such a large vocabulary, I'm getting memory error in python. Is there any way to overcome this or my approach is wrong?

why do you have 1M words? are you 1 hot encoding for multiple languages? English for example only has 200k or so words per https://en.oxforddictionaries.com/explore/how-many-words-are-there-in-the-english-language/ — vencaslac, Nov 13 '18 at 20:46
I am working on a morphologically rich language. I picked the vocabulary from a word2vec model of that language. Should I pick the most frequent words? — xoxis, Nov 13 '18 at 20:53
try to find a list of stop words for your language and unless you _need_ to include specialty terms that are not in common usage try to scale it back a little, the alternative is to buy more ram — vencaslac, Nov 13 '18 at 20:55
I think I shouldn't remove stopwords while developing a language model. I tried it on a server (ram more than 100 GB). I'm not sure why it still shows memory error. btw thank you — xoxis, Nov 13 '18 at 20:59
I'm not sure, but I assume that you use Softmax as output layer. I suggest sigmoid and binary encoding, for example you want to predict 8 words, for such case will be 3 neurons enough, word0= 0 0 0 word1= 0 0 1 word2=0 1 0... and after model run you check if some neuron cross the treshold and consider them as 1. — viceriel, Nov 14 '18 at 07:17

One hot encoding of 1 million category

0 Answers0