-1

I have a set of 20 small document which talks about a particular kind of issue (training data). Now i want to identify those docs out of 10K documents, which are talking about the same issue.

For the purpose i am using the doc2vec implementation:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
    
# Tokenize_and_stem is creating the tokens and stemming and returning the list
# documents_prb store the list of 20 docs
tagged_data = [TaggedDocument(words=tokenize_and_stem(_d.lower()), tags=[str(i)]) for i, _d in enumerate(documents_prb)]
max_epochs = 20
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)
for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
    
def doc2vec_score(s):
    s_list = tokenize_and_stem(s)
    v1 = model.infer_vector(s_list)
    similar_doc = model.docvecs.most_similar([v1])
    original_match = (X[int(similar_doc[0][0])])
    score = similar_doc[0][1]
    match = similar_doc[0][0]
    return score,match


final_data  = []

# df_ws is the list of 10K docs for which i want to find the similarity with above 20 docs
for index, row in df_ws.iterrows():
    print(row['processed_description'])
    data = (doc2vec_score(row['processed_description']))
    L1=list(data)
    L1.append(row['Number'])
    final_data.append(L1)
     
with open('file_cosine_d2v.csv','w',newline='') as out:
    csv_out=csv.writer(out)
    csv_out.writerow(['score','match','INC_NUMBER'])
    for row in final_data:
        csv_out.writerow(row)

But, I am facing the strange issue, the results are highly un-reliable (Score is 0.9 even if there is not a slightest match) and score is changing with great margin every time. I am running the doc2vec_score function. Can someone please help me what is wrong here ?

taha
  • 722
  • 7
  • 15
Shivam Agrawal
  • 2,053
  • 4
  • 26
  • 42

1 Answers1

1

First & foremost, try not using the anti-pattern of calling train multiple times in your own loop.

See this answer for more details: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?

If there's still a problem after that fix, edit your question to show the corrected code, and a more clear example of the output you consider unreliable.

For example, show the actual doc-IDs & scores, and explain why you think the probe document you're testing should be "not a slightest match" for any documents returned.

And note that if a document is truly nothing like the training documents, for example by using words that weren't in the training documents, it's not really possible for a Doc2Vec model to detect that. When it infers vectors for new documents, all unknown words are ignored. So you'll be left with a document using only known words, and it will return the best matches for that subset of your document's words.

More fundamentally, a Doc2Vec model is really only learning ways to contrast the documents that are in the universe demonstrated by the training set, by their words' cooccurrences. If presented with a document with either totally different words, or words whose frequencies/cooccurrences are totally unlike anything seen before, its output will be essentially random, without much meaningful relationship to other more-typical documents. (That'll be maybe-close, maybe-far, because in a way the training on the 'known universe' tends to fill the whole available space.)

So, you wouldn't want to use a Doc2Vec model trained only only positive examples of what you want to recognize, if you also want to recognize negative examples. Rather, include all kinds, then remember the subset that's relevant for certain in/out decisions – and use that subset for downstream comparisons, or multiple subsets to feed a more-formal classification or clustering algorithm.

gojomo
  • 52,260
  • 14
  • 86
  • 115