1

My data is as follows: each observation is a person, and the variables are time spent (in minutes) doing a given activity in the early morning, late morning, afternoon, evening, and night (5 variables). I converted the time spent to a percentage, so each person's data (i.e. each row) would add up to 1

I want to group people based on their patterns of doing this activity. For example, one group could be people who do most of their work in the early morning and a little in the evening, another could be those who only work at a certain time, etc.

I have a few questions on how to go about this:

1- Since I am using percentages that add up to 1, I think my variables are not independent. Is the dependency a problem for clustering?

2- Is there a particular advantage of using Gaussian Mixture Models instead of KMeans here?

3- For evaluating the clustering, is .4 a good Silhouette score?

4- If the Silhouette Score for different number of clusters varies from .4 to .49, can I choose a number of clusters that does not give the highest Silhouette score but gives a more balanced number of observations in each sample (because I prefer having balanced classes)?

5- Is there a way to "toss" observations that are on the boundary of clusters, just to make clusters more dense and improve the Silhouette score?

6- Is reducing the number of variables a good idea? for example, I could merge early morning with late morning into one variable, so I would have 4 factors instead of 5. Does this usually help improve the clustering?

Thanks for any help!

Fate
  • 75
  • 1
  • 9

1 Answers1

1
  1. No, however, fewer dimensions is always better than many, so why don't you just toss your last number, thus reducing the number of dimensions by 1.
  2. Not in general.
  3. The documentation gives a pretty good idea of how to use the Silhouette score.
  4. See above.
  5. Seems like a very poor idea.
  6. In general, no (to take an extreme example, lumping ALL the observations together will not give useful clustering (though it WILL give a very tight cluster)). However, hierarchical clustering (which google) addresses this problem.
Igor Rivin
  • 4,632
  • 2
  • 23
  • 35
  • Thank you so much! this is very helpful. Could you please elaborate on why #5 is a bad idea? If the goal is to find daily work patterns, can't we say something like "90% of users follow one of 3 distinct patterns, which we will analyze to compare their characteristics. The remaining 10% of users had different patterns and could not be classified in these 3 groups"? – Fate Jun 10 '20 at 13:00