Keras Tokenizer num_words doesn't seem to work

Question

>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> t.word_index
{'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11, 'other': 9, 'so': 5, 'world': 1, 'hello': 4}

I'd have expected t.word_index to have just the top 3 words. What am I doing wrong?

https://stackoverflow.com/a/63658573/10375049 – Marco Cerliani Aug 30 '20 at 21:54 — Marco Cerliani, Aug 30 '20 at 21:54

score 28 · Accepted Answer · answered Sep 13 '17 at 19:13

28

There is nothing wrong in what you are doing. word_index is computed the same way no matter how many most frequent words you will use later (as you may see here). So when you will call any transformative method - Tokenizer will use only three most common words and at the same time, it will keep the counter of all words - even when it's obvious that it will not use it later.

answered Sep 13 '17 at 19:13

Marcin Możejko

39,542
10
109
120

1

So `num_words` has no bearing on `fit_on_texts()` either? – flow2k Sep 10 '19 at 00:57

score 10 · Answer 2 · edited Oct 24 '19 at 07:21

10

Just a add on Marcin's answer ("it will keep the counter of all words - even when it's obvious that it will not use it later.").

The reason it keeps counter on all words is that you can call fit_on_texts multiple times. Each time it will update the internal counters, and when transformations are called, it will use the top words based on the updated counters.

Hope it helps.

edited Oct 24 '19 at 07:21

CKE

1,533
19
18
29

answered Oct 24 '19 at 06:33

Gaddy

453
4
12

score 5 · Answer 3 · answered Nov 30 '19 at 01:03

Limiting num_words to a small number (eg, 3) has no effect on fit_on_texts outputs such as word_index, word_counts, word_docs. It does have effect on texts_to_matrix. The resulting matrix will have num_words (3) columns.

>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> print(t.word_index)
{'world': 1, 'this': 2, 'is': 3, 'hello': 4, 'so': 5, 'fantastic': 6, 'there': 7, 'no': 8, 'other': 9, 'like': 10, 'one': 11}

>>> t.texts_to_matrix(l, mode='count')
array([[0., 1., 1.],       
       [0., 1., 1.]])

score 2 · Answer 4 · answered May 12 '21 at 03:30

Just to add a little bit to farid khafizov's answer, words at sequence of num_words and above are removed from the results of texts_to_sequences (4 in 1st, 5 in 2nd and 6 in 3rd sentence disappeared respectively)

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

print(tf.__version__) # 2.4.1, in my case
sentences = [
    'I love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words=4)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
seq = tokenizer.texts_to_sequences(sentences)
print(word_index)  # {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}
print(seq)         # [[3, 1, 2], [3, 1, 2], [1, 2]]

Keras Tokenizer num_words doesn't seem to work

4 Answers4

Linked

Related