In the given line of code tokenizer=Tokenizer(num_words=, oov_token= '<OOV>')
, what does the num_words parameter actually do and what to take into consideration before determining the value to assign to it. What will be the effect of assigning a very high value to it and a very low one.

- 149
- 10
-
This might be helpful because it contains also the example: https://stackoverflow.com/questions/46202519/keras-tokenizer-num-words-doesnt-seem-to-work – Ena Dec 31 '20 at 16:40
1 Answers
It is basically the size of vocabulary you want to have it in your model based on the data you have. Below simple example will explain you in detail.
Without num_words:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)
print("sequences : ",sequences,'\n')
print("word_index : ",tokenizer.word_index)
print("word counts : ",tokenizer.word_counts)
sequences : [[3, 4, 2, 1, 6, 7, 2, 8]]
word_index : {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts : OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)])
Here tokenizer.fit_on_texts(fit_text)
will create the word_index
of the words mentioned present in fit_text
in the order starting from oov_token
which will be 1 and followed by most frequent words from the word_counts
.
If you don't mention num_words
then all the unique words of fit_text
will be considered for word_index
and will be used to represent the sequences
.
If the num_words
is present then it will restrict the sequences to num_words -1
words from word_index
will only be considered to form the sequence while using tokenizer.texts_to_sequences()
if any word is present beyond num_words -1
it will be considered as oov_token
.
Below is the example of it.
With use of num_words:
tokenizer = Tokenizer(num_words=4,oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)
print("sequences : ",sequences,'\n')
print("word_index : ",tokenizer.word_index)
print("word counts : ",tokenizer.word_counts)
sequences : [[3, 1, 2, 1, 1, 1, 2, 1]]
word_index : {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts : OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)])
Regarding the accuracy of the model it's always better to have the correct representation of the words in sequences from your data instead of oov_token
.
In case of large data it's always better to provide the num_words parameter instead of giving load to model.
It's a good practice to do the preprocessing like stopword removal,lemmatization/stemming
to remove all the unnecessary words and then followed by Tokenizer
with the preprocessed data to choose the num_words
parameter better.