0

In the given line of code tokenizer=Tokenizer(num_words=, oov_token= '<OOV>'), what does the num_words parameter actually do and what to take into consideration before determining the value to assign to it. What will be the effect of assigning a very high value to it and a very low one.

DrDoggo
  • 149
  • 10
  • This might be helpful because it contains also the example: https://stackoverflow.com/questions/46202519/keras-tokenizer-num-words-doesnt-seem-to-work – Ena Dec 31 '20 at 16:40

1 Answers1

0

It is basically the size of vocabulary you want to have it in your model based on the data you have. Below simple example will explain you in detail.

Without num_words:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer  = Tokenizer(oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)

print("sequences : ",sequences,'\n')

print("word_index : ",tokenizer.word_index) 

print("word counts : ",tokenizer.word_counts) 

sequences :  [[3, 4, 2, 1, 6, 7, 2, 8]] 

word_index :  {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts :  OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)]) 

Here tokenizer.fit_on_texts(fit_text) will create the word_index of the words mentioned present in fit_text in the order starting from oov_token which will be 1 and followed by most frequent words from the word_counts.
If you don't mention num_words then all the unique words of fit_text will be considered for word_index and will be used to represent the sequences.

If the num_words is present then it will restrict the sequences to num_words -1 words from word_index will only be considered to form the sequence while using tokenizer.texts_to_sequences() if any word is present beyond num_words -1 it will be considered as oov_token.
Below is the example of it.

With use of num_words:

tokenizer  = Tokenizer(num_words=4,oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)

print("sequences : ",sequences,'\n')

print("word_index : ",tokenizer.word_index)

print("word counts : ",tokenizer.word_counts) 

sequences :  [[3, 1, 2, 1, 1, 1, 2, 1]] 

word_index :  {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts :  OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)]) 

Regarding the accuracy of the model it's always better to have the correct representation of the words in sequences from your data instead of oov_token.
In case of large data it's always better to provide the num_words parameter instead of giving load to model.
It's a good practice to do the preprocessing like stopword removal,lemmatization/stemming to remove all the unnecessary words and then followed by Tokenizer with the preprocessed data to choose the num_words parameter better.