When I train Doc2vec (using Gensim's Doc2vec in Python) on corpus of about 10k documents (each has few hundred words) and then infer document vectors using the same documents, they are not at all similar to the trained document vectors. I would expect they would be at least somewhat similar.
That is I do model.docvecs['some_doc_id']
and model.infer_vector(documents['some_doc_id'])
.
Cosine distances between trained and inferred vectors for few first documents:
0.38277733326
0.284007549286
0.286488652229
0.173178792
0.370117008686
0.275438070297
0.377647638321
0.171194493771
0.350615143776
0.311795353889
0.342757165432
As you can see, they are not really similar. If the similarity is so terrible even for documents used for training, I can't even begin to try to infer unseen documents.
Training configuration:
model = Doc2Vec(documents=documents, dm=1, size=100, window=6, alpha=0.1, workers=4,
seed=44, sample=1e-5, iter=15, hs=0, negative=8, dm_mean=1, min_alpha=0.01, min_count=2)
Inferring:
model.infer_vector(tokens, steps=20, alpha=0.025)
Note on the side: Documents are always preprocessed the same way (I checked that the same list of tokens goes into training and into inferring).
Also I played with parameters around a bit, too, and results were similar. So if your suggestion would be something like "try increasing or decreasing this or that training parameter", I've most likely tried it. Maybe I just didn't come across the 'correct' parameters though.
Thanks for any suggestions as to what can I do to make it work better.
EDIT: I am willing and able to use any other available Python implementation of paragraph vectors (doc2vec). It doesn't have to be this one. If you know of another that can achieve better results.
EDIT: Minimal working example
import fnmatch
import os
from scipy.spatial.distance import cosine
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from keras.preprocessing.text import text_to_word_sequence
files = {}
folder = 'some path' # each file contains few regular sentences
for f in fnmatch.filter(os.listdir(folder), '*.sent'):
files[f] = open(folder + '/' + f, 'r', encoding="UTF-8").read()
documents = []
for k, v in files.items():
words = text_to_word_sequence(v, lower=True) # converts string to list of words, removes commas etc.
documents.append(TaggedDocument(tags=[k], words=words))
d2 = Doc2Vec(size=200, documents=documents)
for doc in documents:
trained = d2.docvecs[doc.tags[0]]
inferred = d2.infer_vector(doc.words, steps=50)
print(cosine(trained, inferred)) # cosine similarity from scipy