How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

Question

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset?

Why this way, I don't want to limit the tokenizer vocabulary size to know how well my Keras model perform without it. But then I need to pass on this vocabulary size as the argument in the model's first layer definition.

today · Accepted Answer · 2018-11-29T10:54:08.417

18

All the words and their indices will be stored in a dictionary which you can access it using tokenizer.word_index. Therefore, you can find the number of the unique words based on the number of elements in this dictionary:

num_words = len(tokenizer.word_index) + 1

That + 1 is because of reserving padding (i.e. index zero).

Note: This solution (obviously) is applicable when you have not set num_words argument (i.e. you don't know or want to limit the number of words), since word_index contains all the words (and not only the most frequent words) no matter you set num_words or not.

edited Nov 29 '18 at 10:54

answered Nov 28 '18 at 18:44

today

32,602
8
95
115

4

Doesn't seem right, because when I initiate the tokenizer as `Tokenizer(num_words=50000)` and execute `len(tokenizer.word_index) + 1` I see a number like 75000, way more than the limit that I had defined. How is this possible? – karthiks Nov 28 '18 at 18:54
3

@karthiks You mentioned you don't want to set the `num_words`. The `word_index` contains **all the words**, no matter you set `num_words` or not. Therefore, this solution works when you have not limited the number of words (i.e. have not set `num_words` argument). Otherwise, if you have set the `num_words` then you know what the number of words is and you don't need this solution in the first place! :) I added a note to my answer to clarify this. – today Nov 28 '18 at 19:19
2

I was pointing at validating the assumption that vocabulary_size = `len(tokenizer.word_index)+1` is failing. – karthiks Nov 28 '18 at 20:13
3

I think +1 is for "Out of Vocabulary" word – hAlE Feb 10 '20 at 00:58
@hAlE the you could print out the word_index, there is a OOV . so what is the reserving padding means? – En Xie May 07 '22 at 11:09
Could u explain a bit about + 1 is because of reserving padding – En Xie May 07 '22 at 11:35
@EnXie Usually (though, not always) you will use a padding token to pad input to make them have the same length. By default it's mapped to index zero, and it's not included in `word_index`. If you don't want to count that, then don't use `+1`. – today May 07 '22 at 12:34

How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

1 Answers1