Using keras tokenizer for new words not in training set

Question

I'm currently using the Keras Tokenizer to create a word index and then matching that word index to the the imported GloVe dictionary to create an embedding matrix. However, the problem I have is that this seems to defeat one of the advantages of using a word vector embedding since when using the trained model for predictions if it runs into a new word that's not in the tokenizer's word index it removes it from the sequence.

#fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index

#load glove embedding into a dict
embeddings_index = {}
dims = 100
glove_data = 'glove.6B.'+str(dims)+'d.txt'
f = open(glove_data)
for line in f:
    values = line.split()
    word = values[0]
    value = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = value
f.close()

#create embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, dims))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector[:dims]

#Embedding layer:
embedding_layer = Embedding(embedding_matrix.shape[0],
                        embedding_matrix.shape[1],
                        weights=[embedding_matrix],
                        input_length=12)

#then to make a prediction
sequence = tokenizer.texts_to_sequences(["Test sentence"])
model.predict(sequence)

So is there a way I can still use the tokenizer to transform sentences into an array and still use as much of the words GloVe dictionary as I can instead of only the ones that show up in my training text?

Edit: Upon further contemplation, I guess one option would be to add a text or texts to the texts that the tokenizer is fit on that includes a list of the keys in the glove dictionary. Though that might mess with some of the statistics if I want to use tf-idf. Is there either a preferable way to doing this or a different better approach?

That is a common problem of word-based tokenization. One approach is to ignore those words, as it's currently happening. A slightly preferable alternative sometimes is to have a token which means "unseen word". Also, there are a few methods on how to compose embeddings of unseen words from those of seen words (check out "out of vocabulary embeddings"). Finally, some people use embedding of character n-grams instead of word embeddings to actually address that problem (especially in scenarios with large and changing vocabularies such as Twitter). — JARS, Mar 08 '18 at 13:34
related: https://stackoverflow.com/questions/45735070/keras-text-preprocessing-saving-tokenizer-object-to-file-for-scoring/51203923#51203923 — Quetzalcoatl, Jul 06 '18 at 06:34
hi @JARS, may you provide some link or an example regarding what you said about "Finally, some people use embedding of character n-grams..." ? I didn´t find anything more clear that could help. — Kleyson Rios, Feb 14 '19 at 11:40
@KleysonRios you can use subword models, like [fastText](https://arxiv.org/abs/1607.04606), [BPE](https://arxiv.org/abs/1710.02187), and [ngram2vec](http://www.aclweb.org/anthology/D17-1023) — Separius, Mar 12 '19 at 05:55
your problem is handling oov words - Out Of Vocabulary words. You can use the inbuilt *oov parameter for the keras tokenizer* if you want to keep using the GloVe embeddings -- or you may want to swap GloVe with fastText word embeddings, since fastText handles oov words inherently and has an overall performance similar to GloVe. — Aamir Syed, Apr 26 '22 at 01:30

score 12 · Answer 1 · answered Aug 24 '19 at 09:22

In Keras Tokenizer you have the oov_token parameter. Just select your token and unknown words will have that one.

tokenizer_a = Tokenizer(oov_token=1)
tokenizer_b = Tokenizer()
tokenizer_a.fit_on_texts(["Hello world"])
tokenizer_b.fit_on_texts(["Hello world"])

Outputs

In [26]: tokenizer_a.texts_to_sequences(["Hello cruel world"])
Out[26]: [[2, 1, 3]]

In [27]: tokenizer_b.texts_to_sequences(["Hello cruel world"])
Out[27]: [[1, 2]]

score 7 · Answer 2 · answered Apr 26 '19 at 15:20

I would try a different approach. The main problem is that your word_index is based on your training data. Try this:

#load glove embedding into a dict
embeddings_index = {}
dims = 100
glove_data = 'glove.6B.'+str(dims)+'d.txt'
f = open(glove_data)
for line in f:
    values = line.split()
    word = values[0]
    value = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = value
f.close()

word_index = {w: i for i, w in enumerate(embeddings_index.keys(), 1)}

#create embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, dims))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector[:dims]

Now your embedding_matrix contains all the GloVe works.

To tokenize your texts you can use something like this:

from keras.preprocessing.text import text_to_word_sequence

def texts_to_sequences(texts, word_index):
    for text in texts:
        tokens = text_to_word_sequence(text)
        yield [word_index.get(w) for w in tokens if w in word_index]

sequence = texts_to_sequences(['Test sentence'], word_index)

mitra mirshafiee · Answer 3 · 2020-06-19T07:44:59.793

I had the same problem. In fact, Gloved covered about 90 percent of my data before it was tokenized.

what I did was that I created a list of the words from my text column in pandas dataframe and then created a dictionary of them with enumerate.

(just like what tokenizer in Keras does but without changing the words and listing them by their frequency).

Then I checked for words in Glove and added the vector in Glove to my initial weights matrix, whenever my word was in the Glove dictionary.

I hope the explanation was clear. This is the code for further explanation:

# creating a vocab of my data
vocab_of_text = set(" ".join(df_concat.text).lower().split())

# creating a dictionary of vocab with index
vocab_of_text = list(enumerate(vocab_of_text, 1))

# putting the index first
indexed_vocab = {k:v for v,k in dict(vocab_of_text).items()}

Then we use Glove for our weights matrix:

# creating a matrix for initial weights
vocab_matrix = np.zeros((len(indexed_vocab)+1,100))



# searching for vactors in Glove
for i, word in indexed_vocab.items():
    vector = embedding_index.get(word)
    # embedding index is a dictionary of Glove
    # with the shape of 'word': vecor

    if vector is not None:
        vocab_matrix[i] = vector

and then for making it ready for embedding:

def text_to_sequence(text, word_index):
    tokens = text.lower().split()
    return [word_index.get(token) for token in tokens if word_index.get(token) is not None]

# giving ids
df_concat['sequences'] = df_concat.text.apply(lambda x : text_to_sequence(x, indexed_vocab))

max_len_seq = 34

# padding
padded = pad_sequences(df_concat['sequences'] ,
              maxlen = max_len_seq, padding = 'post', 
              truncating = 'post')

also thanks to @spadarian for his answer. I could come up with this after reading and implementing his idea.part.

Using keras tokenizer for new words not in training set

3 Answers3

Linked