0

I wrote the code below: to implement the word2vec on it, now im testing to get embedding for w2v_model.wv['car_NOUN'] but I get error as below : "word 'car_NOUN' not in vocabulary" but im sure that word car_NOUN is in the vocabulary , what is the problem ? can someone help me?

about code : I used Use spacy to restrict the words in the tweets to content words, i.e., nouns, verbs, and adjectives. Transform the words to lower case and add the POS with an underderscore. E.g.:love_VERB .then I wanted to implement word2vec on new list but I came up with that error

love_VERB old-fashioneds_NOUN

KeyError                                  Traceback (most recent call last)
<ipython-input-145-f6fb9c62175c> in <module>()
----> 1 w2v_model.wv['car_NOUN']

2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
    450             return result
    451         else:
--> 452             raise KeyError("word '%s' not in vocabulary" % word)
    453 
    454     def get_vector(self, word):

KeyError: "word 'car_NOUN' not in vocabulary"
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')


from zipfile import ZipFile
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
    zf.extractall()


import pandas as pd
df = pd.read_csv('reviews.full.tsv', sep='\t', nrows=100000) # nrows , max amount of rows 
documents = df.text.values.tolist()
print(documents[:4])


import spacy

nlp = spacy.load('en_core_web_sm') #you can use other methods
# excluded tags
included_tags = {"NOUN", "VERB", "ADJ"}
#document = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]

sentences = documents[:103] #first 10 sentences
new_sentences = []
for sentence in sentences:
    new_sentence = []
    for token in nlp(sentence):
        if token.pos_  in included_tags:
            new_sentence.append(token.text.lower()+'_'+token.pos_)
    new_sentences.append(" ".join(new_sentence))

def convert(new_sentences): 
    return ' '.join(new_sentences).split() 

x=convert(new_sentences)


from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION


# initialize model
w2v_model = Word2Vec(size=100,
                     window=15,
                     sample=0.0001,
                     iter=200,
                     negative=5, 
                     min_count=100,
                     workers=-1, 
                     hs=0
)

w2v_model.build_vocab(x)

w2v_model.train(x, 
                total_examples=w2v_model.corpus_count, 
                epochs=w2v_model.epochs)


w2v_model.wv['car_NOUN']
eli
  • 184
  • 2
  • 12

1 Answers1

2

You had a mistake in your convert function: you are supposed to pass a list of lists to Word2Vec, as in, a list that contains the sentences in lists. I have changed that for you. Basically, you want to go from sometihng like this

['prices_NOUN',
  'change_VERB',
  'want_VERB',
  'research_VERB',
  'price_NOUN',
  'many_ADJ',
  'different_ADJ',
  'sites_NOUN',
  'found_VERB',
  'cheaper_ADJ',]

To something like this

[['prices_NOUN',
  'change_VERB',
  'want_VERB',]
  ['research_VERB',
  'price_NOUN',
  'many_ADJ',]
  ['different_ADJ',
  'sites_NOUN',
  'found_VERB',
  'cheaper_ADJ',]]

I have also altered the code around training the model a bit for you to make it work for me, you might want to experiment with that.

! pip install wget

from gensim.models.word2vec import FAST_VERSION
from gensim.models import Word2Vec
import spacy
import pandas as pd
from zipfile import ZipFile
import wget

url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')

with ZipFile('reviews.full.tsv.zip', 'r') as zf:
    zf.extractall()

# nrows , max amount of rows
df = pd.read_csv('reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.values.tolist()

nlp = spacy.load('en_core_web_sm')  # you can use other methods
# excluded tags
included_tags = {"NOUN", "VERB", "ADJ"}

sentences = documents[:103]  # first 10 sentences
new_sentences = []
for sentence in sentences:
    new_sentence = []
    for token in nlp(sentence):
        if token.pos_ in included_tags:
            new_sentence.append(token.text.lower()+'_'+token.pos_)
    new_sentences.append(new_sentence)


# initialize model
w2v_model = Word2Vec(new_sentences,
                     size=100,
                     window=15,
                     sample=0.0001,
                     iter=200,
                     negative=5,
                     min_count=1,  # <-- it seems your min_count was too high
                     workers=-1,
                     hs=0
                     )

w2v_model.wv['car_NOUN']

Returns

array([ 3.4433445e-03, -4.6847924e-03, -4.6468928e-04, -4.1419661e-04,
        1.6716495e-03, -1.3368594e-03,  2.3602389e-03, -3.5505681e-03,
       -2.6509305e-04,  5.3194270e-04,  2.3251947e-03,  2.1161686e-03,
        3.8566503e-03, -1.0463649e-03, -3.4403126e-04, -2.3808836e-03,
       -1.7489052e-03, -3.6803843e-03, -5.5171514e-04, -4.3218122e-03,
        3.2187223e-03, -1.4893038e-04, -4.7250376e-03, -3.9506676e-03,
        4.9547744e-03,  6.8341813e-04, -1.7588978e-03,  2.9804371e-03,
        1.4809771e-03,  3.8084502e-03,  3.7447066e-05, -2.6706287e-03,
       -8.4727036e-04, -4.8435321e-03, -4.4348584e-03, -3.9350889e-03,
        4.1925525e-03, -2.7435150e-03,  2.5154117e-03, -4.5825918e-03,
       -3.8889556e-03,  4.0331958e-03, -5.7232054e-04,  1.7530264e-03,
        3.8368679e-03, -3.4817799e-03,  2.4366400e-03, -3.7075430e-03,
       -1.2156683e-03,  4.4666473e-03,  1.7927163e-05, -3.2169635e-03,
        1.9718746e-03, -3.0671202e-03, -8.5452310e-04, -2.9490239e-03,
       -4.1346985e-04,  8.5071824e-04,  4.4970238e-03, -2.8501134e-03,
        4.4103153e-03,  1.4589783e-03,  3.6588225e-03, -1.4809598e-03,
       -9.8118311e-05,  2.4781735e-03, -2.4647343e-03,  2.2115968e-03,
        3.1630241e-03, -1.5672935e-04,  1.6695650e-03,  3.5689210e-03,
       -2.6638571e-03,  3.4224256e-03, -1.5750986e-03,  3.6926002e-03,
        3.2584099e-03,  3.8033908e-03,  1.5272110e-04, -2.2282582e-03,
       -4.7118403e-04, -2.5838052e-03, -2.8910220e-03, -3.1307489e-03,
       -4.0518055e-03, -2.3207215e-03,  1.2772443e-03, -4.4162138e-03,
       -1.9835744e-03,  3.0219899e-03,  1.7312685e-03,  3.9408603e-03,
       -5.6407665e-04,  3.2022693e-03, -8.9243404e-04,  4.5719477e-03,
        4.7199172e-03, -4.9393933e-05,  2.2010114e-03, -3.4861618e-03],
      dtype=float32)
Bertil Johannes Ipsen
  • 1,656
  • 1
  • 14
  • 27
  • thank you I got the same error with this code also – eli May 28 '20 at 13:34
  • yes I run the code that you put above and I got the same error, you changed this ?w2v_model = Word2Vec(new_sentences, right? – eli May 28 '20 at 13:39
  • 1
    That's not all - try to copy-paste the whole thing and try it out, it really should work :-) – Bertil Johannes Ipsen May 28 '20 at 13:40
  • yes it works thank you very much , and how about traning part . it is like this ``` w2v_model.build_vocab(new_sentences) w2v_model.train(new_sentences, total_examples=w2v_model.corpus_count, epochs=w2v_model.epochs) ``` – eli May 28 '20 at 13:44
  • You don't need that ([check out the documentation](https://radimrehurek.com/gensim/models/word2vec.html)). Passing the sentences to Word2Vec does all that for you! If you want a nice explanation, [check this answer out](https://stackoverflow.com/a/48725677/7891326) – Bertil Johannes Ipsen May 28 '20 at 13:46
  • because then I should Train 4 more Word2vec models and average the resulting embedding matrices.? so no need to train? – eli May 28 '20 at 13:47
  • This is another question, but again, read the links i provided, and yes, that's where you would use `build vocab`. – Bertil Johannes Ipsen May 28 '20 at 13:48
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/214827/discussion-between-bertil-johannes-ipsen-and-elham). – Bertil Johannes Ipsen May 28 '20 at 13:49