3

I have a corpus trained with Doc2Vec as follows:

d2vmodel = Doc2Vec(vector_size=100, min_count=5, epochs=10)
d2vmodel.build_vocab(train_corpus)
d2vmodel.train(train_corpus, total_examples=d2vmodel.corpus_count, epochs=d2vmodel.epochs)

Using the vectors, the documents are clustered with kmeans:

kmeans_model = KMeans(n_clusters=NUM_CLUSTERS, init='k-means++', random_state = 42)  
X = kmeans_model.fit(d2vmodel.docvecs.vectors_docs)
labels=kmeans_model.labels_.tolist()

I would like to use the k-means to cluster a new document and know which cluster it belongs to. I've tried the following but I don't think the input for predict is correct.

from numpy import array
testdocument = gensim.utils.simple_preprocess('Microsoft excel')
cluster_label = kmeans_model.predict(array(testdocument))

Any help is appreciated!

kami
  • 361
  • 3
  • 15

1 Answers1

6

Your kmeans_model expects a features-vector similar to what it was provided during its original clustering – not the list-of-string-tokens you'll get back from gensim.simple_preprocess().

In fact, you want to use the Doc2Vec model to take such lists-of-tokens and turn them into model-compatible vectors, via its infer_vector() method. For example:

testdoc_words = gensim.utils.simple_preprocess('Microsoft excel')
testdoc_vector = d2vmodel.infer_vector(testdoc_words)
cluster_label = kmeans_model.predict(array(testdoc_vector))

Note that both Doc2Vec and inference work better on documents of at least tens-of-words long (not tiny 2-word phrases like your test here), and that inference may also often benefit from using a larger-than-default optional epochs parameter (especially on short documents).

Note also that your test document should be really preprocessed and tokenized exactly the same as your training data – so if some other process was used for preparing train_corpus, use that same process for post-training documents. (Words not recognized by the Doc2Vec model, because they weren't present during training, will be silently ignored – so an error like doing a different style of case-flattening at inference time will weaken results a lot.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    Thanks for that, I will certainly test with the `epoc` param too. The string is only for purposes of demo, the actual documents are much longer. Small comment -- for the single sample I had to reshape the vector before passing it onto predict: `testdoc_vector.reshape(1,-1)` – kami Dec 10 '18 at 01:58