0

I created the word embedding vector for sentiment analysis. But I'm not sure about the code I wrote. If you see my mistakes while creating Word2vec or embedding matrix, please let me know.

EMBEDDING_DIM=100 
review_lines = [sub.split() for sub in reviews]    
model = gensim.models.Word2Vec(sentences=review_lines,size=EMBEDDING_DIM,window=6,workers=6,min_count=3,sg=1) 
print('Words close to the given word:',model.wv.most_similar('film'))    
words=list(model.wv.vocab) 
print('Words:' , words)

file_name='embedding_word2vec.txt'
model.wv.save_word2vec_format(file_name,binary=False)     
embeddings_index = {}    
f=open(os.path.join('','embedding_word2vec.txt'),encoding="utf-8")    
for line in f:    
  values =line.split()    
  word=values[0]   
  coefs=np.asarray(values[1:],dtype='float32')   
  embeddings_index[word]=coefs    
f.close()  
print("Number of word vectors found:",len(embeddings_index))
  
embedding_matrix = np.zeros((len(word_index)+1,EMBEDDING_DIM))
for word , i in word_index.items():
  embedding_vector= embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i]=embedding_vector

OUTPUT:
array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.1029947 ,  0.07595579, -0.06583303, ...,  0.10382118,
        -0.56950015, -0.17402627],
       [ 0.13758609,  0.05489254,  0.0969701 , ...,  0.18532865,
        -0.49845088, -0.23407038],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])
Seda Yılmaz
  • 31
  • 1
  • 8
  • What is `word_index` (a variable used in your code but never declared), and why would its list of words necessarily match those in the earlier `model`? Why are you writing your own code to read the file, when you could use Gensim's build-in `KeyedVectors.load_word2vec_format(FILENAME)`? – gojomo Jan 25 '21 at 17:55
  • The word index variable is a dictionary containing 15 thousand most commonly used words in my data. There is a word_index implementation in the rest of my code. I tried to create the word embedding myself here so I did the file reading but I guess it was unnecessary. I don't know the advantage of the clear pre-trained word embedding. KeyedVectors.load_word2vec_format ( FILENAME) I also have no idea how to use it. – Seda Yılmaz Jan 25 '21 at 19:30

1 Answers1

0

It's likely the zero rows are there because you initialized the embedding_matrix with all zeros, but then your loop didn't replace those zeros for every row.

If any of the words in word_index aren't in the embeddings_index dict you've built (or the model before that, that would be the expected result.

Note that while the saved word-vector format isn't very complicated, you still don't nee to write your own code to parse it back in. The KeyedVectors.load_word2vec_format() method will work for that, giving you an object that allows dict-like access to each vector, by its word key. (And, the vectors are stored in a dense array, so it's a bit more memory efficient than a true dict with a separate ndarray vector as each value.)

There would still be the issue of your word_index listing words that weren't trained by the model. Perhaps they weren't in your training texts, or didn't appear at least min_count (default: 5) times, as required for the model to take notice of them. (You could consider lowering min_count, but note that it's usually a good idea to discard such very-rare words - they wouldn't have created very good vectors from few examples, and even including such thinly-represented words can worsen surrounding word's vectors.)

If you absolutely need vectors for words no in your training data, the FastText variant of the word2vec algorithm can, in languages where similar words often share similar character-runs, offer synthesized vectors for unknown words that are somewhat better than random/null-vectors for most downstream applications. But you really should prefer to have adequate real examples of each interesting words' usage in varying contexts.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thank you very much for your explanation. It was very revealing. I have one more question. I use IMDB movie comments as a dataset. I use 'GoogleNews-vectors-negative300.bin' as a pre-trained word embedding. These two are quite different dataset. 'GoogleNews-vectors-negative300.bin' Would it be unreasonable to use it? Or a word used for movie commentary may not be in news texts. What kind of a problem does this pose? – Seda Yılmaz Jan 26 '21 at 15:15
  • Yes, reusing the 'GoogleNews' vectors can be a problem, in a number of ways: (1) Google's never fully documented exactly how they preprocessed/phrase-combined those tokens, so preprocessing other texts to match tokens is often an ad-hoc process; (2) word senses in pre-2013 news stories will vary from those used in other domains; (3) words used in other domains, especially newer words, may not appear there at all. If you have enough data, it's often better to train your own word-vectors on your own domain data – at least as one option to test against using older/other-domain word-vectors! – gojomo Jan 26 '21 at 16:37
  • Hi :) I have one more question.Even if I do not perform data cleaning, which is one of the first steps, I get a successful result in the training set. Shouldn't my accuracy and loss be bad when I'm not doing data cleaning? In my model, when I'm not doing data clearing ,my acc value is 91% and my loss value is 11%. What is the reason for this situation? – Seda Yılmaz Jan 27 '21 at 15:09
  • I don't know what you an by "data cleaning" here, nor what later steps you're doing to calculate those accuracy numbers. Whether any particular data-preprocessing is worth the trouble depends on the specifics of your data, goals, choice of algorithms, etc. (It's not automatically the case that "more fussing around with the data" is better - you have to try things & compare the results of each variant, and sometimes "standard" steps may discard data that in your particular case are helpful.) – gojomo Jan 27 '21 at 16:33
  • While doing data cleaning, I followed these steps: Lowercasing,number removing,punctuation removing,stop words removing and lemmatizing. I guess I don't need these steps too much, so I will try again and remove unnecessary processes. If it is not needed, I will not perform any data cleaning steps. Thank you very much for your information. You have been very helpful :) – Seda Yılmaz Jan 27 '21 at 19:19
  • Those are all common steps, but if you can evaluate every permutation, you may find some help & some don't! For example, in movie titles, capitalization, especially mid-sentence capitalization, might be an important distinction to word sense. Lemmatization sometimes helps 'stretch' limited data by grouping related word-forms into one token - but sometimes the word-form distinctions are crucial for classification, and with lots of data, an algorithm like word2vec can learn OK vectors for each individual word-form, which are themselves close-enough together to indicate the relatedness. – gojomo Jan 27 '21 at 20:54