0

I am comparing techniques and want to find out what is the best method to vector and reduce dimensions of a large number of text documents. I have already tested Bag of Words and TF-IDF and reduced dimensions with PCA, SVD, and NMF. Using these approaches I can reduce my data and know the best number of dimensions based on the variance explained.

However, I want to do the same with doc2vec, considering that doc2vec itself is a dimensional reducer, what is the best approach to find out the number of dimensions for my model? Is there any statistical measure that helps me find the best number of vector_size?

Thanks in advance!

bcosta
  • 81
  • 10

1 Answers1

0

There's no magic indicator for what's best; you should try a range of dimensionalities to see what scores well on your specific downstream evaluations, given your data & goals.

If using a doc2vec implementation that offers inference of out-of-training set documents (such as via the .infer_vector() method in Python gensim library), then a plausible sanity check for eliminating very-bad choices of vector_size (or other parameters) is to re-infer vectors for training-set documents.

If repeated re-inferences of the same text are are generally "close to" each other, and to the vector for that same document created by the full model training, that's a weak indicator that the model is at least behaving in a self-consistent way. (If the spread of results is large, that might indicate potential problems with insufficient data, too few training epochs, a too-large/overfit model, or other foundational issues.)

gojomo
  • 52,260
  • 14
  • 86
  • 115