I have a corpus trained with Doc2Vec as follows:
d2vmodel = Doc2Vec(vector_size=100, min_count=5, epochs=10)
d2vmodel.build_vocab(train_corpus)
d2vmodel.train(train_corpus, total_examples=d2vmodel.corpus_count, epochs=d2vmodel.epochs)
Using the vectors, the documents are clustered with kmeans
:
kmeans_model = KMeans(n_clusters=NUM_CLUSTERS, init='k-means++', random_state = 42)
X = kmeans_model.fit(d2vmodel.docvecs.vectors_docs)
labels=kmeans_model.labels_.tolist()
I would like to use the k-means to cluster a new document and know which cluster it belongs to. I've tried the following but I don't think the input for predict is correct.
from numpy import array
testdocument = gensim.utils.simple_preprocess('Microsoft excel')
cluster_label = kmeans_model.predict(array(testdocument))
Any help is appreciated!