It appears you might be using gensim Doc2Vec
, but out-of-the-box it doesn't support a pretrained_emb
argument, and using pretrained word embeddings isn't necessarily a benefit for most applications, especially if you have adequate training documents. Also, you wouldn't normally be able to use 100-dimensional word-vectors, from somewhere else, to help seed a 300-dimensional Doc2Vec
model. (I'm surprised whatever code you're using doesn't error from this mismatch.)
Typical published work using this algorithm ('Paragraph Vector') uses 10, 20 or more training passes, but (again assuming you're using gensim) you've left it at the default value of just 5.
Lowering the min_count
to a non-default value of 1 usually makes results worse, as words with such few occurrences just serve as noise making the learned vectors for other documents/words less consistent.
Which two sentences are you comparing, and how?
Since the algorithm itself uses randomized initialization, and then several forms of random sampling during training, and then multi-threaded training adds some additional randomization of the order-of-text-processing, running Doc2Vec
on the exact same corpus repeatedly won't usually obtain identical results.
Having the same text appear twice in the training-set, with different tags
, won't necessarily get the same vector – though they should be similar. (They should generally become more similar with more training passes, but smaller documents may show more variance from text-to-text, or run-to-run, because with fewer target-words they get adjusted by the model-in-progress fewer times. (The same would occur for inferring vectors, post-model-training, for the same text repeatedly... though adjusting the infer_vector()
values of steps
or alpha
might make results more stable from run-to-run.)