I have an issue with the keras IMDB database.
The code I am starting from is in the accepted answer here:
Restore original text from Keras’s imdb dataset
import keras
NUM_WORDS=1000 # only use top 1000 words
INDEX_FROM=3 # word index offset
train,test = keras.datasets.imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)
train_x,train_y = train
test_x,test_y = test
word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in train_x[0] ))
However, I agree with the comment by Nate Raw saying:
This code is actually incorrect. One line should be changed to word_to_id={k:(v+INDEX_FROM-1) for k,v in word_to_id.items()}. The indexes in the downloaded word_to_id dictionary are actually starting at 1. So, when you add INDEX_FROM to the indexes, it causes there to be a gap between id_to_word[2] and id_to_word[4]. There is no value for id_to_word[3]
If I follow this comment and use INDEX_FROM - 1
, the reconstructed review text does not make any sense.
What about the id_to_word[3]?
Is there anyone who tried to solve this issue?