Clustering: Is it a problem if factors are not independent? how to evaluate the model?

Question

My data is as follows: each observation is a person, and the variables are time spent (in minutes) doing a given activity in the early morning, late morning, afternoon, evening, and night (5 variables). I converted the time spent to a percentage, so each person's data (i.e. each row) would add up to 1

I want to group people based on their patterns of doing this activity. For example, one group could be people who do most of their work in the early morning and a little in the evening, another could be those who only work at a certain time, etc.

I have a few questions on how to go about this:

1- Since I am using percentages that add up to 1, I think my variables are not independent. Is the dependency a problem for clustering?

2- Is there a particular advantage of using Gaussian Mixture Models instead of KMeans here?

3- For evaluating the clustering, is .4 a good Silhouette score?

4- If the Silhouette Score for different number of clusters varies from .4 to .49, can I choose a number of clusters that does not give the highest Silhouette score but gives a more balanced number of observations in each sample (because I prefer having balanced classes)?

5- Is there a way to "toss" observations that are on the boundary of clusters, just to make clusters more dense and improve the Silhouette score?

6- Is reducing the number of variables a good idea? for example, I could merge early morning with late morning into one variable, so I would have 4 factors instead of 5. Does this usually help improve the clustering?

Thanks for any help!

score 1 · Answer 1 · answered Jun 09 '20 at 21:32

1

No, however, fewer dimensions is always better than many, so why don't you just toss your last number, thus reducing the number of dimensions by 1.
Not in general.
The documentation gives a pretty good idea of how to use the Silhouette score.
See above.
Seems like a very poor idea.
In general, no (to take an extreme example, lumping ALL the observations together will not give useful clustering (though it WILL give a very tight cluster)). However, hierarchical clustering (which google) addresses this problem.

answered Jun 09 '20 at 21:32

Igor Rivin

4,632
2
23
35

Thank you so much! this is very helpful. Could you please elaborate on why #5 is a bad idea? If the goal is to find daily work patterns, can't we say something like "90% of users follow one of 3 distinct patterns, which we will analyze to compare their characteristics. The remaining 10% of users had different patterns and could not be classified in these 3 groups"? – Fate Jun 10 '20 at 13:00

Clustering: Is it a problem if factors are not independent? how to evaluate the model?

1 Answers1