0

I am using doc2vec to convert the top 100 tweets of my followers in vector representation (say v1.....v100). After that I am using the vector representation to do the K-Means clusters.

model = Doc2Vec(documents=t, size=100, alpha=.035, window=10, workers=4, min_count=2)

I can see that cluster 0 is dominated by some values (say v10, v12, v23, ....). My question is what does these v10, v12 ... etc represents. Can I deduce that these specific column clusters specific keywords of document.

pankaj jha
  • 299
  • 5
  • 15

3 Answers3

2

Don't use the individual variables. They should be only analyzed together because of the way these embeddings are trained.

For a starter, find

  1. The most similar document vectors to your centroid to see typical cluster members
  2. The most similar term vectors from the embedding for typical words to describe the cluster
  3. Note the distances to see how good your fit is.
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks, This is really a great approach. I will try and let you know the results. My results are showing some good trends like all companies/firm who are just for promotions of their products are getting clustered together. People having words like arts etc clustered at one place – pankaj jha Aug 28 '17 at 18:41
  • What do you mean by most similar term vector? I have done the step 1 and identified the k nearest neighbor to the centroid. Should I do word count/tf-idf to find the most similar term vectors or use word2vec over each cluster to find the same – pankaj jha Sep 04 '17 at 08:21
  • Use the same computation (dot product) that doc2vec uses. – Has QUIT--Anony-Mousse Sep 04 '17 at 19:29
0

The clusters themselves does not mean anything specific. You can have as many clusters as you want and all the clustering algorithm would do is try to distribute all your vectors among these clusters. If you are aware of all the tweets and know how many different topics you want them to be separated in, try to clean them or have features in them such that the clustering algorithm can use those to segregate them in the clusters of your choice.

Also if you meant topic modeling, that is different from clustering and you should also look that up.

Devaraj Phukan
  • 196
  • 1
  • 9
  • No, only a few algorithms such as k-means will distribute all points to k clusters. Quite a lot of the modem algorithms don't. And even with k-means, the clusters do havesome meaning. It's just not easy to map back through word2vec into the original data space. – Has QUIT--Anony-Mousse Aug 28 '17 at 18:25
  • All I want to segment similar followers using content of their tweets. There are ways to find the optimal number of clusters in data. So I don't agree that clustering is totally useless. I am just experimenting to see if doc2vec can do better segmentation and of course it should do some kind of segmentation of topics users are interested in . – pankaj jha Aug 28 '17 at 18:28
0

These values represent the coordinates of the individual tweets (or documents) that you want to represent in a cluster. I am assuming that v1 to v100 represent the vectors for tweets 1 to 100, otherwise this won't make sense.So if suppose cluster 0 has v1,v5 and v6, this means that tweets 1, 5 and 6 with vector representation v1,v5 and v6 respectively (or the tweets with vectors v1, v5 and v6 as their representation) belong to the cluster 0.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
  • 1
    You really should read up on word2vec. No, the variables don't correspond to tweets. – Has QUIT--Anony-Mousse Aug 28 '17 at 18:23
  • @Anony-Mousse I am using doc2vec. In case of word2vec I can try to make sense from clusters of word but defining two documents of 100 tweets is very difficults as users are tweeting on diverse topics. But my results are showing some good trends like all companies/firm who are just for promotions of their products are getting clustered together. People having words like arts etc clustered at one place. But how to use these variables to define the properties of cluster is big question – pankaj jha Aug 28 '17 at 18:35