Restore original text from Keras’s imdb dataset - AGAIN

Question

I have an issue with the keras IMDB database.

The code I am starting from is in the accepted answer here:

Restore original text from Keras’s imdb dataset

import keras
NUM_WORDS=1000 # only use top 1000 words
INDEX_FROM=3   # word index offset

train,test = keras.datasets.imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)
train_x,train_y = train
test_x,test_y = test

word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in train_x[0] ))

However, I agree with the comment by Nate Raw saying:

This code is actually incorrect. One line should be changed to word_to_id={k:(v+INDEX_FROM-1) for k,v in word_to_id.items()}. The indexes in the downloaded word_to_id dictionary are actually starting at 1. So, when you add INDEX_FROM to the indexes, it causes there to be a gap between id_to_word[2] and id_to_word[4]. There is no value for id_to_word[3]

If I follow this comment and use INDEX_FROM - 1, the reconstructed review text does not make any sense.

What about the id_to_word[3]?

Is there anyone who tried to solve this issue?

Did you try what what Nate said, and change `v+INDEX_FROM` to `v+INDEX_FROM-1`? Regardless, if you leave the code as-is, does it still work for your case? — The Guy with The Hat, Sep 26 '19 at 15:09
May be it is just me, but can you please elaborate the problem a bit more? Is there any expected output that you are unable to achieve? — Ahsun Ali, Sep 26 '19 at 15:53

tung2389 · Answer 1 · 2019-12-14T11:08:22.570

0

As you can see in Keras imdb soure code and this answer, it adds index_from to the original index of each word. Therefore, because there is no id_to_word[0], the id_to_word[3] should be "UNUSED" and the first valid index is 4 => the code must be v + INDEX_FROM

edited Dec 14 '19 at 11:08

answered Dec 14 '19 at 11:02

tung2389

131
1
12

Restore original text from Keras’s imdb dataset - AGAIN

1 Answers1