IMDB dataset preprocessing unsuitable for GLoVe word embeddings?

Question

I want to train a simple sentiment classifier on the IMDB dataset using pretrained GLoVe vectors, an LSTM and final dense layer with sigmoid activation.

The problem I have is that the obtained accuracy is relatively low: 78% . This is lower than the 82% accuracy when using a trainable embedding layer instead of GLoVe vectors.

I think the main reason for this is because only 67.9% of words in the dataset are found in the GLoVe file (I am using the 6B corpus).

I looked at some words which were not found in the GLoVe file and some examples are :

grandmother's twin's

Basically a lot of words that have a quote are not found in the GLoVe file.

I wonder if the data needs to be preprocessed differently. Currently, the preprocessing is taken care by the function imdb.load_data().

I tried using the larger 42B words corpus, but that only resulted in 76.5% coverage.

I wonder if the data ought to be tokenized differently to get a good coverage.

The code is this:

load_embeddings.py

from numpy import asarray
import time

def load_embeddings(filename):
    start_time = time.time()
    embeddings_index = dict()
    f = open(filename, encoding = 'utf8')
    for line in f:
        values = line.split()
        word = values[0]
        embedding_vector = asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding_vector
    f.close()
    end_time = time.time()
    print('Loaded %s word vectors in %f seconds' % (len(embeddings_index), end_time- start_time))
    return embeddings_index

train.py

from __future__ import print_function
import numpy as np

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
from load_embeddings import load_embeddings

maxlen = 80
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data()
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

word_to_index = imdb.get_word_index()
vocab_size = len(word_to_index)
print('Vocab size : ', vocab_size)


words_freq_list = []
for (k,v) in imdb.get_word_index().items():
  words_freq_list.append((k,v))

sorted_list = sorted(words_freq_list, key=lambda x: x[1])

print("50 most common words: \n")
print(sorted_list[0:50])


# dimensionality of word embeddings
EMBEDDING_DIM = 100

# Glove file
GLOVE_FILENAME = 'data/glove.6B.100d.txt'

# Word from this index are valid words. i.e  3 -> 'the' which is the
# most frequent word
INDEX_FROM = 3

word_to_index = {k:(v+INDEX_FROM-1) for k,v in imdb.get_word_index().items()}
word_to_index["<PAD>"] = 0
word_to_index["<START>"] = 1
word_to_index["<UNK>"] = 2

embeddings_index = load_embeddings(GLOVE_FILENAME)
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size+INDEX_FROM, EMBEDDING_DIM))
# unknown words are mapped to zero vector
embedding_matrix[0] = np.array(EMBEDDING_DIM*[0])
embedding_matrix[1] = np.array(EMBEDDING_DIM*[0])
embedding_matrix[2] = np.array(EMBEDDING_DIM*[0])

for word, i in word_to_index.items():
  embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector
  # uncomment below to see which words were not found
  # else :
  #   print(word, ' not found in GLoVe file.')

nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
coverage = nonzero_elements / vocab_size
print('Coverage = ',coverage)


# Build and train model

print('Build model...')
model = Sequential()
model.add(Embedding(vocab_size+INDEX_FROM, EMBEDDING_DIM, weights=[embedding_matrix], trainable=False, name= 'embedding'))
model.add(LSTM(EMBEDDING_DIM, dropout=0.2, recurrent_dropout=0.2, name = 'lstm'))
model.add(Dense(1, activation='sigmoid', name='out'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

A fairly standard step in any NLP project is stemming/lemmatizing your words to prevent exactly the scenario you describe (different forms of the same word appearing as different words in the corpus). I would look into different methods for accomplishing that task, try some things, and see if that improves your results — G. Anderson, May 20 '19 at 21:35

TheLoneDeranger · Answer 1 · 2019-05-21T06:26:43.483

0

This may help. Your thinking is good. It's not a bad idea to try other pretrained vectors; sometimes they'll be quite a lot better right out of the gate. Also, you can use Gensim to add entries into GloVe or whichever.

edited May 21 '19 at 06:26

answered May 21 '19 at 06:13

TheLoneDeranger

1,161
9
13

IMDB dataset preprocessing unsuitable for GLoVe word embeddings?

1 Answers1