2

I am clustering a set of education documents using doc2vec.

As a human, I think of these as in categories such as:

  • computer-related
  • language related
  • collaboration
  • arts

etc.

I wonder if there is a way to 'guide' the doc2vec clustering into a set of clusters that are human-interpretable.

One strategy I have been trying is to filter out all 'nonsense' words, and only train doc2vec on the words that seem meaningful. But of course, this seems to perhaps ruin the training.

Something just occurred to me that might work:

  • Train on entire documents (don't filter out words) to create doc2vec space

  • Filter nonsense words ('help', 'student', etc. are words that have very little meaning in this space) out of each document

  • Project filtered documents into doc2vec space

  • then process using k-means etc

I would appreciate any constructive suggestions or next steps.

best

1 Answers1

0

Your plan is fine; you should try it to evaluate the results. The clusters may not map tightly to your preconceived groupings, but by looking at the example docs per cluster, you'll probably be able to form your own rough idea of what the cluster "is" in human-crafted descriptive terms.

Don't try too much guesswork preprocessing (like eliminating words) at first. Try those kinds of variations after you have the simplest possible approach working, as a baseline – so you can evaluate (even if only by ad hoc eyeballing) whether they're helping as expected. (For example, if a word like 'student' truly appears across all documents equally, it won't have much influence either way on Doc2Vec final doc coordinates... so you don't have to make that judgement call yourself, it'll just be deemphasized automatically.)

I'm assuming that by Doc2Vec you mean the 'Paragraph Vector' algorithm, as implemented by the Doc2Vec class in Python gensim. Some PV-Doc2Vec modes, including the default PV-DM (dm=1) and also the simpler PV-DBOW if you also enable concurrent word-training (dm=0, dbow_words=1), train word-vectors into the same space as doc-vectors. So the word-vectors that are closest to the doc-vectors in a cluster, or the cluster's centroid, might be useful as interpretable descriptions of the cluster.

(In the word-vector space, there's also research that tries to make the individual dimensions of word-vectors more-interpretable by constraining training in some way, such as requiring vectors to be spares with only non-negative dimensions. See for example this NNSE work and other papers like it. Presumably that might also be applicable to doc-vectors, but I don't know offhand any papers or libraries to do that.)

You could also apply other topic-modeling algorithms, like LDA, that calculate discrete 'topics' that are usually fairly interpretable, and report the strongest topics in each document. (You can cluster on the full doc-topics weights, or perhaps just naively assign each document to its one strongest topic as a simple kind of clustering.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks so much, I don't quite know how to do all of those things, though I am somewhat familiar with them. Sent you a PM. – user7400474 Apr 16 '18 at 18:11