I have a certain Doc2Vec model built on website data. I am trying to use the embeddings to find websites that are most similar to each other. To do so, I am doing a cosine similarity of the matrix. I am also comparing this to the output of most_similar().
The problem, they are providing substantively different matches (not only slightly different).
To make concrete, for a firm of index value 791 and text on value text I compare.
text = self.website_info.iloc[791].text
tokens = text.split()
vec = self.word2vec_model.infer_vector(tokens,negative=0)
most_similar = self.word2vec_model.docvecs.most_similar([vec])
to
self.word2vec_model.init_sims()
mat = self.word2vec_model.docvecs.get_normed_vectors()
w2v_sim = np.dot(mat, mat.T)
sims = pd.DataFrame(pd.Series(w2v_sim[791]))
sims.rename(columns={0:'sim'}, inplace = True)
sims.sort_values(by='sim',ascending=False,inplace=True)
most_similar = sims.head(20)
I also see that the embedding vectors real and inferred are substantively different. Not just normalization or values, but big differences in the sign of the components.