Use of num_words in the tokenizer class in Keras

Question

Wanted to understand the difference between,

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 1)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

O/P - {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

vs

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

O/P - {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

If the tokeniser dynamically indexes all the unique words, what's the use of num_words?

Marco Cerliani · Accepted Answer · 2020-08-31T22:51:05.070

5

word_index it's simply a mapping of words to ids for the entire text corpus passed whatever the num_words is

the difference is evident in the usage. for example, if we call texts_to_sequences

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 1+1)
tokenizer.fit_on_texts(sentences)
tokenizer.texts_to_sequences(sentences) # [[1], [1], [1]]

only the love id is returned because the most frequent word

instead

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100+1)
tokenizer.fit_on_texts(sentences)
tokenizer.texts_to_sequences(sentences) # [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4]]

the ids of the most 100 frequent words is returned

edited Aug 31 '20 at 22:51

answered Aug 30 '20 at 14:39

Marco Cerliani

21,233
3
49
54

Just one more question @MarcoCerliani, why does num_words = 1 give an empty mapping? – Akash Sonthalia Aug 30 '20 at 18:30
yes, the first token (the id 0) is always reserved... consider always num_token +1 – Marco Cerliani Aug 30 '20 at 19:17

Use of num_words in the tokenizer class in Keras

1 Answers1

Linked