Your plan is fine; you should try it to evaluate the results. The clusters may not map tightly to your preconceived groupings, but by looking at the example docs per cluster, you'll probably be able to form your own rough idea of what the cluster "is" in human-crafted descriptive terms.
Don't try too much guesswork preprocessing (like eliminating words) at first. Try those kinds of variations after you have the simplest possible approach working, as a baseline – so you can evaluate (even if only by ad hoc eyeballing) whether they're helping as expected. (For example, if a word like 'student' truly appears across all documents equally, it won't have much influence either way on Doc2Vec
final doc coordinates... so you don't have to make that judgement call yourself, it'll just be deemphasized automatically.)
I'm assuming that by Doc2Vec you mean the 'Paragraph Vector' algorithm, as implemented by the Doc2Vec
class in Python gensim
. Some PV-Doc2Vec modes, including the default PV-DM (dm=1
) and also the simpler PV-DBOW if you also enable concurrent word-training (dm=0, dbow_words=1
), train word-vectors into the same space as doc-vectors. So the word-vectors that are closest to the doc-vectors in a cluster, or the cluster's centroid, might be useful as interpretable descriptions of the cluster.
(In the word-vector space, there's also research that tries to make the individual dimensions of word-vectors more-interpretable by constraining training in some way, such as requiring vectors to be spares with only non-negative dimensions. See for example this NNSE work and other papers like it. Presumably that might also be applicable to doc-vectors, but I don't know offhand any papers or libraries to do that.)
You could also apply other topic-modeling algorithms, like LDA, that calculate discrete 'topics' that are usually fairly interpretable, and report the strongest topics in each document. (You can cluster on the full doc-topics weights, or perhaps just naively assign each document to its one strongest topic as a simple kind of clustering.)