i want to get a list of semantically similar words from the two embedded documents in python

Question

I am working on text embedding in python. Where I found the similarity between two documents with the Doc2vec model. the code is as follows:

for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words) # it takes each document words as a input and produce vector of each document
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) # it takes list of all document's vector as a input and compare those with the trained vectors and gives the most similarity of 1st document to other and then second to other and so on .
    print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
    print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
    for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
        print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

now, from these two embedded documents, how can I extract a set of semantically similar words of those particular documents.

please, help me out.

Can you be more more specific about what you mean by "semantically similar words of those particular documents"? Do you mean words that are somehow similar-to, or representative-of, the whole documents? Or pairs of words, one from document-A, one from document-B, that are similar? Or something else? (What's the ultimate goal you'd like to achieve?) — gojomo, Apr 11 '20 at 00:12
I mean for example if Doc1 = 'sky beautiful weather good' and Doc2 = 'nice weather black rainclouds dance like peacock' and also these two documents (Doc1 and Doc2) are similar about 80%. Now I want the list of similar word-pairs of these documents, like [sky-rainclouds, beautiful-nice, weather-weather, good-nice,... ]. — chirayu upadhyay, Apr 11 '20 at 11:20

score 0 · Accepted Answer · answered Apr 12 '20 at 03:03

0

Only some Doc2Vec modes also train word-vectors: dm=1 (the default), or dm=0, dbow_words=1 (DBOW doc-vectors but added skip-gram word-vectors. If you've used such a mode, then there will be word-vectors in your model.wv property.

A call to model.wv.similarity(word1, word2) method will give you the pairwise similarity for any 2 words.

So, you could iterate over all the words in doc1, then collect the similarities to each word in doc2, and report the single highest similarity for each word.

answered Apr 12 '20 at 03:03

gojomo

52,260
14
86
115

hii gojomo, While doing what you suggested, I got this error ____ KeyError: "word 'liberal' not in vocabulary" – chirayu upadhyay Apr 14 '20 at 23:26
If the word `'liberal'` is not in your model, that's the error you'll get. If you'll be checking texts & words against your model that might not be present, you should check if they're present first. (For example, test if `'liberal' in model.wv`.) – gojomo Apr 14 '20 at 23:31
And what if 'liberal' this word is not in the model? – chirayu upadhyay Apr 14 '20 at 23:54
If there is not a word-vector for `'liberal'` in the model, then you can't look up a compatible word-vector for it. You should either skip it, or find some other model that can provide all the word-vectors you need. – gojomo Apr 15 '20 at 01:22

i want to get a list of semantically similar words from the two embedded documents in python

1 Answers1