29
>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> t.word_index
{'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11, 'other': 9, 'so': 5, 'world': 1, 'hello': 4}

I'd have expected t.word_index to have just the top 3 words. What am I doing wrong?

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
max_max_mir
  • 1,494
  • 3
  • 20
  • 36

4 Answers4

28

There is nothing wrong in what you are doing. word_index is computed the same way no matter how many most frequent words you will use later (as you may see here). So when you will call any transformative method - Tokenizer will use only three most common words and at the same time, it will keep the counter of all words - even when it's obvious that it will not use it later.

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
10

Just a add on Marcin's answer ("it will keep the counter of all words - even when it's obvious that it will not use it later.").

The reason it keeps counter on all words is that you can call fit_on_texts multiple times. Each time it will update the internal counters, and when transformations are called, it will use the top words based on the updated counters.

Hope it helps.

CKE
  • 1,533
  • 19
  • 18
  • 29
Gaddy
  • 453
  • 4
  • 12
5

Limiting num_words to a small number (eg, 3) has no effect on fit_on_texts outputs such as word_index, word_counts, word_docs. It does have effect on texts_to_matrix. The resulting matrix will have num_words (3) columns.

>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> print(t.word_index)
{'world': 1, 'this': 2, 'is': 3, 'hello': 4, 'so': 5, 'fantastic': 6, 'there': 7, 'no': 8, 'other': 9, 'like': 10, 'one': 11}

>>> t.texts_to_matrix(l, mode='count')
array([[0., 1., 1.],       
       [0., 1., 1.]])
Farid Khafizov
  • 1,062
  • 12
  • 8
2

Just to add a little bit to farid khafizov's answer, words at sequence of num_words and above are removed from the results of texts_to_sequences (4 in 1st, 5 in 2nd and 6 in 3rd sentence disappeared respectively)

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

print(tf.__version__) # 2.4.1, in my case
sentences = [
    'I love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words=4)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
seq = tokenizer.texts_to_sequences(sentences)
print(word_index)  # {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}
print(seq)         # [[3, 1, 2], [3, 1, 2], [1, 2]]
Rich KS
  • 73
  • 6