Note that many word2vec/doc2vev projects don't apply word-stemming (converting words to their roots), nor remove stop words. With an adequately-large training corpus, neither step is strictly necessary.
You seem to be at a very rudimentary starting point, so you should work through online examples of Doc2Vec
(and more generally "topic modeling"). Several Jupyter Notebooks demonstrating both basic and more advanced uses of Doc2Vec
are included with gensim
, in the installations docs/notebooks
directory. You can also view them online at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/
doc2vec-lee.ipynb
: very simple example of usage on toy-sized data
doc2vec-IMDB.ipynb
: more advanced example based on a movie-reviews experiment included in the original "Paragraph Vector" (Doc2Vec
) research paper
doc2vec-wikipedia.ipynb
: much larger & longer-running model using millions of Wikipedia articles
Though you can browse these online, you can and should run them locally step-by-step as a learning exercise, then tinker with them slightly as an exploration, before finally using them (and other sources) as guides for how you can approach your own problem.