1

I am working on text embedding in python. Where I found the similarity between two documents with the Doc2vec model. the code is as follows:

for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words) # it takes each document words as a input and produce vector of each document
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) # it takes list of all document's vector as a input and compare those with the trained vectors and gives the most similarity of 1st document to other and then second to other and so on .
    print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
    print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
    for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
        print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

now, from these two embedded documents, how can I extract a set of semantically similar words of those particular documents.

please, help me out.

  • Can you be more more specific about what you mean by "semantically similar words of those particular documents"? Do you mean words that are somehow similar-to, or representative-of, the whole documents? Or pairs of words, one from document-A, one from document-B, that are similar? Or something else? (What's the ultimate goal you'd like to achieve?) – gojomo Apr 11 '20 at 00:12
  • I mean for example if Doc1 = 'sky beautiful weather good' and Doc2 = 'nice weather black rainclouds dance like peacock' and also these two documents (Doc1 and Doc2) are similar about 80%. Now I want the list of similar word-pairs of these documents, like [sky-rainclouds, beautiful-nice, weather-weather, good-nice,... ]. – chirayu upadhyay Apr 11 '20 at 11:20

1 Answers1

0

Only some Doc2Vec modes also train word-vectors: dm=1 (the default), or dm=0, dbow_words=1 (DBOW doc-vectors but added skip-gram word-vectors. If you've used such a mode, then there will be word-vectors in your model.wv property.

A call to model.wv.similarity(word1, word2) method will give you the pairwise similarity for any 2 words.

So, you could iterate over all the words in doc1, then collect the similarities to each word in doc2, and report the single highest similarity for each word.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • hii gojomo, While doing what you suggested, I got this error ____ KeyError: "word 'liberal' not in vocabulary" – chirayu upadhyay Apr 14 '20 at 23:26
  • If the word `'liberal'` is not in your model, that's the error you'll get. If you'll be checking texts & words against your model that might not be present, you should check if they're present first. (For example, test if `'liberal' in model.wv`.) – gojomo Apr 14 '20 at 23:31
  • And what if 'liberal' this word is not in the model? – chirayu upadhyay Apr 14 '20 at 23:54
  • If there is not a word-vector for `'liberal'` in the model, then you can't look up a compatible word-vector for it. You should either skip it, or find some other model that can provide all the word-vectors you need. – gojomo Apr 15 '20 at 01:22