I don't have large corpus of data to train word similarities e.g. 'hot' is more similar to 'warm' than to 'cold'. However, I like to train doc2vec on a relatively small corpus ~100 docs so that it can classify my domain specific documents.
To elaborate let me use this toy example. Assume I've only 4 training docs given by 4 sentences - "I love hot chocolate.", "I hate hot chocolate.", "I love hot tea.", and "I love hot cake.". Given a test document "I adore hot chocolate", I would expect, doc2vec will invariably return "I love hot chocolate." as the closest document. This expectation will be true if word2vec already supplies the knowledge that "adore" is very similar to "love". However, I'm getting most similar document as "I hate hot chocolate" -- which is a bizarre!!
Any suggestion on how to circumvent this, i.e. be able to use pre-trained word embeddings so that I don't need to venture into training "adore" is close to "love", "hate" is close to "detest", and so on.
Code (Jupyter Nodebook. Python 3.7. Jensim 3.8.1)
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love hot chocolate.",
"I hate hot chocolate",
"I love hot tea.",
"I love hot cake."]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
print(tagged_data)
#Train and save
max_epochs = 10
vec_size = 5
alpha = 0.025
model = Doc2Vec(vector_size=vec_size, #it was size earlier
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
if epoch % 10 == 0:
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.epochs) #It was model.iter earlier
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
print("Model Ready")
test_sentence="I adore hot chocolate"
test_data = word_tokenize(test_sentence.lower())
v1 = model.infer_vector(test_data)
#print("V1_infer", v1)
# to find most similar doc using tags
sims = model.docvecs.most_similar([v1])
print("\nTest: %s\n" %(test_sentence))
for indx, score in sims:
print("\t(score: %.4f) %s" %(score, data[int(indx)]))