0

I have a certain Doc2Vec model built on website data. I am trying to use the embeddings to find websites that are most similar to each other. To do so, I am doing a cosine similarity of the matrix. I am also comparing this to the output of most_similar().

The problem, they are providing substantively different matches (not only slightly different).

To make concrete, for a firm of index value 791 and text on value text I compare.

text = self.website_info.iloc[791].text
tokens = text.split()
vec = self.word2vec_model.infer_vector(tokens,negative=0)

most_similar = self.word2vec_model.docvecs.most_similar([vec])

to

self.word2vec_model.init_sims()
mat = self.word2vec_model.docvecs.get_normed_vectors()
w2v_sim = np.dot(mat, mat.T)

sims = pd.DataFrame(pd.Series(w2v_sim[791]))
sims.rename(columns={0:'sim'}, inplace = True)
sims.sort_values(by='sim',ascending=False,inplace=True)

most_similar  = sims.head(20)

I also see that the embedding vectors real and inferred are substantively different. Not just normalization or values, but big differences in the sign of the components.

Jorge Guzman
  • 489
  • 1
  • 4
  • 16

1 Answers1

0

There's a bunch in your code that doesn't quite make sense.

  • If you're using Gensim-4.0 or higher – where .get_normed_vectors() exists – there's never a need to call .init_sims(). (In fact, it should be showing a deprecation warning.)

  • Only a Doc2Vec model will support .infer_vector() – so it's odd to name your variable, word2vec_model

  • The .infer_vector() method doesn't take a negative=0 argument - so that code would generate an error in the Gensim library that you otherwise appear to be using. (And, if it did somehow take a negative argument changing the inference to use 0 negative-examples, that'd break inference - which should use the same negative value as during model training.)

I'm also not sure about your alternate calculation - in particular, dirving it through Pandas instead of native Numpy operations seems unnnecessary, and an all-to-all comparison would often be very expensive, in any model with a sufficiently-large number of documents.

But also: the inferred vector will essentially never be identical to the vector for the same text during training. It'll just be 'close', if everything about the model is working well, with sufficient training data & parameters (especially enough epochs & not too many vector_size dimensions). (See this FAQ item for more details on how there's an inherent 'jitter' between runs/inferences.)

So I'd suggest:

  • 1st, check how similar the inferred-vector for your item #791 is to the vector created by bulk-training. (You could both compare them to each other, or compare the list of top-N .most_similar() items to each.) If they're very different, there may be other problems with the model training (data/parameters) that make the model underpowered. (In some cases, more epochs or fewer vector_size dimensions can help a little to make a model more consistent, run-to-run, but if your data is thin there will be limits to how well Doc2Vec can work.)

  • Check that your alternate calculation of the nearest-neighbors exactly matches that returned by .most_similar(), when using the exact same (not inferred) origin-vector. If it doesn't, that'd be a separate issue than any looseness/variance between the vector from bulk-training and that from later re-inference.

  • Try to evaluate the actual quality of the .most_similar() results - either by ad hoc eyeballing, or some sort of rigorous domain-expert golden-standard of which docs 'should' be judged alike. The calculation by the .most_similar() method is a typical approach, and usually what people want - so knowing if that's helpful for your data/model/goals may be more interesting than whether you can match it with a separate external calculation.

If you're still having problems, be sure in any followup comments, question edits, or new questions to say a bit more about:

Those can help determine if something else more foundational is wrong/weak with your model.

gojomo
  • 52,260
  • 14
  • 86
  • 115