5

I have multiple documents that contain multiple sentences. I want to use doc2vec to cluster (e.g. k-means) the sentence vectors by using sklearn.

As such, the idea is that similar sentences are grouped together in several clusters. However, it is not clear to me if I have to train every single document separately and then use a clustering algorithm on the sentence vectors. Or, if I could infer a sentence vector from doc2vec without training every new sentence.

Right now this is a snippet of my code:

sentenceLabeled = []
for sentenceID, sentence in enumerate(example_sentences):
    sentenceL = TaggedDocument(words=sentence.split(), tags = ['SENT_%s' %sentenceID])
    sentenceLabeled.append(sentenceL)

model = Doc2Vec(size=300, window=10, min_count=0, workers=11, alpha=0.025, 
min_alpha=0.025)
model.build_vocab(sentenceLabeled)
for epoch in range(20):
    model.train(sentenceLabeled)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
textVect = model.docvecs.doctag_syn0

## K-means ##
num_clusters = 3
km = KMeans(n_clusters=num_clusters)
km.fit(textVect)
clusters = km.labels_.tolist()

## Print Sentence Clusters ##
cluster_info = {'sentence': example_sentences, 'cluster' : clusters}
sentenceDF = pd.DataFrame(cluster_info, index=[clusters], columns = ['sentence','cluster'])

for num in range(num_clusters):
     print()
     print("Sentence cluster %d: " %int(num+1), end='')
     print()
     for sentence in sentenceDF.ix[num]['sentence'].values.tolist():
        print(' %s ' %sentence, end='')
        print()
    print()

Basically, what I am doing right now is training on every labeled sentence in the document. However, if have the idea that this could be done in a simpler way.

Eventually, the sentences that contain similar words should be clustered together and be printed. At this point training every document separately, does not clearly reveal any logic within the clusters.

Hopefully someone can steer me in the right direction. Thanks.

Boyos123
  • 119
  • 1
  • 5
  • 1
    I know this is a bit late, but you can try other clustering techniques such as hierarchical clustering as well as DBSCAN to see if it improves anything. If something else worked out for you, please share that as well for the benefit of everyone. Thanks! – StatguyUser Jan 17 '18 at 13:13

2 Answers2

1
  • have you looked at the word vectors you get (use DM=1 algorithm setting) ? Do these show good similarities when you inspect them?
  • I would try using tSNE to reduce down your dimensions once you have got some reasonable looking similar word vectors working. You can use PCA first to do that to reduce to say 50 or so dimensions if you need to . Think both are in sklearn. Then see if your documents are forming distinct groups or not like that.
  • also look at your most_similar() document vectors and try infer_vector() on a known trained sentence and you should get a very close similarity to 1 if all is well. (infer_vector() is always a bit different result each time, so never identical!)
Luke Barker
  • 915
  • 7
  • 14
  • Yes, i have looked at the word vectors as well. Alternatively, I have trained a model based on all the documents as stated by thn as well, and used infer_vector(), which shows more promising results. Furthermore, I am considering LDA to infer topics that could be used as clusters. Will also try to incorporate PCA for dimensionality reudction! – Boyos123 Apr 19 '17 at 17:21
0

For your use case, all sentences in all documents should be trained together. Essentially, you should treat sentence as mini-document. Then they all share the same vocabulary and semantic space.

THN
  • 3,351
  • 3
  • 26
  • 40
  • So, this basically means that for every document I have to train a new doc2vec model and use that specific model for clustering? Doing this does not logically cluster the sentences in my opinion, since I can't see the semantic differences between the clusters. – Boyos123 Apr 19 '17 at 10:47
  • No, you don't train new doc2vec model for every document. You train 1 doc2vec model for all sentences in all documents. How is it not clear from my answer? – THN Apr 19 '17 at 13:01
  • you would need to use all sentences to train one model as thn says. Then for new **unseen** sentences you could infer_vector() using your model, then work out which cluster it belongs to in your sklearn clustering. – Luke Barker Apr 19 '17 at 17:04