'word' not in Vocabulary in a corpus with words shown in a single list only in gensim library

Question

Hello Community Members,

At present, I am implementing the Word2Vec algorithm.

Firstly, I have extracted the data (sentences), break and split the sentences into tokens (words), remove the punctuation marks and store the tokens in a single list. The list basically contain the words. Then I have calculated the frequency of words and then computed it occurrences in terms of frequency. It results a list.

Next, I am trying to load the model using gensim. However, I am facing a problem. The problem is about the word is not in the vocabulary. The code snippet, whatever I have tried is as follows.

import nltk, re, gensim
import string
from collections import Counter
from string import punctuation
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from nltk.corpus import gutenberg, stopwords

def preprocessing():
    raw_data = (gutenberg.raw('shakespeare-hamlet.txt'))
    tokens = word_tokenize(raw_data)
    tokens = [w.lower() for w in tokens]
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    global words
    words = [word for word in stripped if word.isalpha()]
    sw = (stopwords.words('english'))
    sw1= (['.', ',', '"', '?', '!', ':', ';', '(', ')', '[', ']', '{', '}'])
    sw2= (['for', 'on', 'ed', 'es', 'ing', 'of', 'd', 'is', 'has', 'have', 'been', 'had', 'was', 'are', 'were', 'a', 'an', 'the', 't', 's', 'than', 'that', 'it', '&', 'and', 'where', 'there', 'he', 'she', 'i', 'and', 'with', 'it', 'to', 'shall', 'why', 'ham'])
    stop=sw+sw1+sw2
    words = [w for w in words if not w in stop]
preprocessing()

def freq_count():
    fd = nltk.FreqDist(words)
    print(fd.most_common())
    freq_count()
def word_embedding():
    for i in range(len(words)):
        model = Word2Vec(words, size = 100, sg = 1, window = 3, min_count = 1, iter = 10, workers = 4)
        model.init_sims(replace = True)
        model.save('word2vec_model')
        model = Word2Vec.load('word2vec_model')
        similarities = model.wv.most_similar('hamlet')
        for word, score in similarities:
            print(word , score)
word_embedding()

Note: I am using Python 3.7 in Windows OS. From the syntax of gensim, it is suggested to use sentences and split into tokens and apply the same to build and train the model. My question is that how to apply the same to a corpus with single list containing only words. I have specified the words also using list, i.e. [words], during the training of the model.

score 2 · Accepted Answer · answered Aug 21 '18 at 11:20

The first parameter passed to Word2Vec expects an list of sentences. You're passing a list of words

import nltk
import re
import gensim
import string
from collections import Counter
from string import punctuation
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from nltk.corpus import gutenberg, stopwords


def preprocessing():
    raw_data = (gutenberg.raw('shakespeare-hamlet.txt'))
    tokens = word_tokenize(raw_data)
    tokens = [w.lower() for w in tokens]
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    global words
    words = [word for word in stripped if word.isalpha()]
    sw = (stopwords.words('english'))
    sw1 = (['.', ',', '"', '?', '!', ':', ';', '(', ')', '[', ']', '{', '}'])
    sw2 = (['for', 'on', 'ed', 'es', 'ing', 'of', 'd', 'is', 'has', 'have', 'been', 'had', 'was', 'are', 'were', 'a', 'an', 'the', 't',
            's', 'than', 'that', 'it', '&', 'and', 'where', 'there', 'he', 'she', 'i', 'and', 'with', 'it', 'to', 'shall', 'why', 'ham'])
    stop = sw + sw1 + sw2
    words = [w for w in words if not w in stop]


preprocessing()


def freq_count():
    fd = nltk.FreqDist(words)
    print(fd.most_common())
    freq_count()


def word_embedding():
    for i in range(len(words)):
        print(type(words))
        #pass words as a list.
        model = Word2Vec([words], size=100, sg=1, window=3,
                        min_count=1, iter=10, workers=4)
        model.init_sims(replace=True)
        model.save('word2vec_model')
        model = Word2Vec.load('word2vec_model')
        similarities = model.wv.most_similar('hamlet')
        for word, score in similarities:
            print(word, score)


word_embedding()

hope this helps :)

Thanks Madhan. From the syntax's, I learned that the input to the w2v model, is facilitated by sentences (list of list of sentences), where each and every sentence is splitted into token of words. I am just interested to know instead of lists of list of splitted words, does the model accepts only a `single list` comprising the token of words of all the sentences. — M S, Aug 22 '18 at 10:11

score 2 · Answer 2 · answered Aug 21 '18 at 17:22

Madhan Varadhodiyil's answer has identified your main problem, passing a list-of-words where Word2Vec expects a sequence-of-sentences (such as a list-of-list-of-words). As a result, each word is seen as a sentence, and then each letter is seen as one word of a sentence – and your resulting model thus probably has just a few dozen single-character 'words'.

If you enabled logging at the INFO level, and watched the output – always good ideas when trying to understand a process or debug a problem – you may have noticed the reported counts of sentences/words as being off.

Additionally:

'Hamlet' has about 30,000 words – but gensim Word2Vec's optimized code has an implementation limit of 10,000 words per text example (sentence) – so passing the full text in as if it were a single text will cause about 2/3 of it to be silently ignored. Pass it as a series of shorter texts (such as sentences, paragraphs, or even scenes/acts) instead.
30,000 words is very, very, very small for good word-vectors, which are typically based on millions to billions of words' worth of usage examples. When working with a small corpus, sometimes more training passes than the default epochs=5 can help, sometimes shrinking the dimensionality of the vectors below the default vector_size=100 can help, but you won't be getting the full value of the algorithm, which really depends on large diverse text examples to achieve meaningful arrangements of words.
Usually words with just 1 or a few usage examples can't get good vectors from those few (not-necessarily-representative) examples, and further the large number of such words act as noise/interference in the training of other words (that could get good word-vectors). So setting min_count=1 usually results in worse word-vectors, for both rare and frequent words, on task-specific measures of quality, than the default of discarding rare words entirely.

Thanks @gojomo for the suggestion. Could you please suggest what will be ideal parameter values of the w2v model. I mean what will be the ideal values of the main parameter of the w2v model, i.e. `min_count, size, workers and window` size for small and large size corpus. — M S, Aug 22 '18 at 10:17
There's no universal ideal: you have to try it with your data, & whatever your particular end goal is, then adjust based on some hopefully-objective and repeatable way of scoring results. (I'm not sure what your end goal may be in training on a single Shakespeare work.) Usually it's good to start with the defaults, then tinker – but if you do have a rigorous repeatable quantitative way to score your model, you can do a larger automatic exploration of possible parameter combinations. You'll usually get fastest training with `workers` somewhere between 3 and the number of CPU cores. — gojomo, Aug 22 '18 at 16:41
The best `window` may vary based on your end-goal, but can perhaps trend smaller with larger corpuses to save time while still getting "good" vectors. The supportable `size` will be smaller with smaller corpuses, and larger with larger corpuses. Often it is necessary to increase `min_count`, discarding more low-frequency words, in larger corpuses just to help keep the model size-on-RAM manageable – but that will often improve the quality of the remaining word-vectors, too. — gojomo, Aug 22 '18 at 16:45

'word' not in Vocabulary in a corpus with words shown in a single list only in gensim library

2 Answers2