0

I have built a gensim Doc2vec model. Let's call it doc2vec. Now I want to find the most relevant words to a given document according to my doc2vec model.

For example, I have a document about "java" with the tag "doc_about_java". When I ask for similar documents, I get documents about other programming languages and topics related to java. So my document model works well.

Now I want to find the most relevant words to "doc_about_java".

I follow the solution from the closed question How to find most similar terms/words of a document in doc2vec? and it gives me seemingly random words, the word "java" is not even among the first 100 similar words:

docvec = doc2vec.docvecs['doc_about_java']
print doc2vec.most_similar(positive=[docvec], topn=100)

I also tried like this:

print doc2vec.wv.similar_by_vector(doc2vec["doc_about_java"])

but it didn't change anything. How can I find the most similar words to a given document?

aburkov
  • 13
  • 2
  • 5

1 Answers1

2

Not all Doc2Vec modes even train word-vectors. In particular, the PV-DBOW mode dm=0, which often works very well for doc-vector comparisons, leaves word-vectors at randomly-assigned (and unused) positions.

So that may explain why the results of your initial attempt to get a list-of-related-words seem random.

To get word-vectors, you'd need to use PV-DM mode (dm=1), or add optional concurrent word-vector training to PV-DBOW (dm=0, dbow_words=1).

(If this isn't the issue, there maybe other problems in your training setup, so you should show more detail about your data source, size, and code.)

(Separately, your alternate attempt code-line, by using doc2vec["doc_about_java"] is retrieving a word-vector for "doc_about_java" (which may not be present at all). To get the doc-vector, use doc2vec.docvecs["doc_about_java"], as in your first code block.)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thank you! This is how I train: doc2vec = Doc2Vec(documents = tagged_documents, vector_size = dimensionality, window = 8, min_count = 5, workers = 4, epochs = epochs). According to https://radimrehurek.com/gensim/models/doc2vec.html dm=1 is the default parameter so I assume that the training is in PV-DM mode. Should I specify dm=1 explicitely? – aburkov Mar 07 '18 at 02:29
  • Also I guess that word vectors have been trained, because print doc2vec.wv.most_similar("java", topn=10) gives me [(u'java/j2ee', 0.4498763680458069), (u'java.', 0.40088707208633423), (u'java/', 0.37919291853904724), (u'java/java', 0.3360544443130493), (u'jpa/hibernate', 0.3188815116882324), (u'java/jee', 0.31387391686439514), (u'jsps', 0.30111759901046753), (u'andgenuinely', 0.2981807589530945), (u'sti-f-06', 0.2916569113731384), (u'-jpa', 0.2886544466018677)] – aburkov Mar 07 '18 at 02:33
  • Yes, since `dm=1` is default, you don't need to specify it, & you should be getting both doc- and word- vectors. And, your check of most-similar for word `'java'` confirms word-training happens. Are you sure your probe `'doc_about_java'` is a good example of a doc about java? Are results for `doc2vec.wv.similar_by_vector(doc2vec.docvecs['doc_about_java']` totally-unrelated? Are you using a small/thin dataset or very few training passes? Are you sure as much training as you expect is happening – especially that `tagged_documents` works as a restartable iterable & logs show expected progress? – gojomo Mar 07 '18 at 02:46
  • The results for words similar to the docuement are totally unrelated: they look like noise of some very rare tokens, completely unrelated to java. The document about java is very representative. The dataset is quite big. When I train the process consumes almost 64G of RAM. You gave me some hints. I will increase the number of training passes and will check logs (if I find them :-) – aburkov Mar 07 '18 at 02:59
  • I'd also double-check that the `most_similar()` doc-vec results are as strong as you think, and some other representative docs, to see if it's a general problem or just within some idiosyncratic docs. – gojomo Mar 07 '18 at 03:15