11

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset?

Why this way, I don't want to limit the tokenizer vocabulary size to know how well my Keras model perform without it. But then I need to pass on this vocabulary size as the argument in the model's first layer definition.

today
  • 32,602
  • 8
  • 95
  • 115
karthiks
  • 7,049
  • 7
  • 47
  • 62

1 Answers1

18

All the words and their indices will be stored in a dictionary which you can access it using tokenizer.word_index. Therefore, you can find the number of the unique words based on the number of elements in this dictionary:

num_words = len(tokenizer.word_index) + 1

That + 1 is because of reserving padding (i.e. index zero).

Note: This solution (obviously) is applicable when you have not set num_words argument (i.e. you don't know or want to limit the number of words), since word_index contains all the words (and not only the most frequent words) no matter you set num_words or not.

today
  • 32,602
  • 8
  • 95
  • 115
  • 4
    Doesn't seem right, because when I initiate the tokenizer as `Tokenizer(num_words=50000)` and execute `len(tokenizer.word_index) + 1` I see a number like 75000, way more than the limit that I had defined. How is this possible? – karthiks Nov 28 '18 at 18:54
  • 3
    @karthiks You mentioned you don't want to set the `num_words`. The `word_index` contains **all the words**, no matter you set `num_words` or not. Therefore, this solution works when you have not limited the number of words (i.e. have not set `num_words` argument). Otherwise, if you have set the `num_words` then you know what the number of words is and you don't need this solution in the first place! :) I added a note to my answer to clarify this. – today Nov 28 '18 at 19:19
  • 2
    I was pointing at validating the assumption that vocabulary_size = `len(tokenizer.word_index)+1` is failing. – karthiks Nov 28 '18 at 20:13
  • 3
    I think +1 is for "Out of Vocabulary" word – hAlE Feb 10 '20 at 00:58
  • @hAlE the you could print out the word_index, there is a OOV . so what is the reserving padding means? – En Xie May 07 '22 at 11:09
  • Could u explain a bit about + 1 is because of reserving padding – En Xie May 07 '22 at 11:35
  • @EnXie Usually (though, not always) you will use a padding token to pad input to make them have the same length. By default it's mapped to index zero, and it's not included in `word_index`. If you don't want to count that, then don't use `+1`. – today May 07 '22 at 12:34