0

I have an issue with the keras IMDB database.

The code I am starting from is in the accepted answer here:

Restore original text from Keras’s imdb dataset

import keras
NUM_WORDS=1000 # only use top 1000 words
INDEX_FROM=3   # word index offset

train,test = keras.datasets.imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)
train_x,train_y = train
test_x,test_y = test

word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in train_x[0] ))

However, I agree with the comment by Nate Raw saying:

This code is actually incorrect. One line should be changed to word_to_id={k:(v+INDEX_FROM-1) for k,v in word_to_id.items()}. The indexes in the downloaded word_to_id dictionary are actually starting at 1. So, when you add INDEX_FROM to the indexes, it causes there to be a gap between id_to_word[2] and id_to_word[4]. There is no value for id_to_word[3]

If I follow this comment and use INDEX_FROM - 1, the reconstructed review text does not make any sense.

What about the id_to_word[3]?

Is there anyone who tried to solve this issue?

Antonio Sesto
  • 2,868
  • 5
  • 33
  • 51

1 Answers1

0

As you can see in Keras imdb soure code and this answer, it adds index_from to the original index of each word. Therefore, because there is no id_to_word[0], the id_to_word[3] should be "UNUSED" and the first valid index is 4 => the code must be v + INDEX_FROM

tung2389
  • 131
  • 1
  • 12