I wrote the code below: to implement the word2vec on it, now im testing to get embedding for w2v_model.wv['car_NOUN'] but I get error as below : "word 'car_NOUN' not in vocabulary" but im sure that word car_NOUN is in the vocabulary , what is the problem ? can someone help me?
about code : I used Use spacy to restrict the words in the tweets to content words, i.e., nouns, verbs, and adjectives. Transform the words to lower case and add the POS with an underderscore. E.g.:love_VERB .then I wanted to implement word2vec on new list but I came up with that error
love_VERB old-fashioneds_NOUN
KeyError Traceback (most recent call last)
<ipython-input-145-f6fb9c62175c> in <module>()
----> 1 w2v_model.wv['car_NOUN']
2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
450 return result
451 else:
--> 452 raise KeyError("word '%s' not in vocabulary" % word)
453
454 def get_vector(self, word):
KeyError: "word 'car_NOUN' not in vocabulary"
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')
from zipfile import ZipFile
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
zf.extractall()
import pandas as pd
df = pd.read_csv('reviews.full.tsv', sep='\t', nrows=100000) # nrows , max amount of rows
documents = df.text.values.tolist()
print(documents[:4])
import spacy
nlp = spacy.load('en_core_web_sm') #you can use other methods
# excluded tags
included_tags = {"NOUN", "VERB", "ADJ"}
#document = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
sentences = documents[:103] #first 10 sentences
new_sentences = []
for sentence in sentences:
new_sentence = []
for token in nlp(sentence):
if token.pos_ in included_tags:
new_sentence.append(token.text.lower()+'_'+token.pos_)
new_sentences.append(" ".join(new_sentence))
def convert(new_sentences):
return ' '.join(new_sentences).split()
x=convert(new_sentences)
from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION
# initialize model
w2v_model = Word2Vec(size=100,
window=15,
sample=0.0001,
iter=200,
negative=5,
min_count=100,
workers=-1,
hs=0
)
w2v_model.build_vocab(x)
w2v_model.train(x,
total_examples=w2v_model.corpus_count,
epochs=w2v_model.epochs)
w2v_model.wv['car_NOUN']