My data is as follows: each observation is a person, and the variables are time spent (in minutes) doing a given activity in the early morning, late morning, afternoon, evening, and night (5 variables). I converted the time spent to a percentage, so each person's data (i.e. each row) would add up to 1
I want to group people based on their patterns of doing this activity. For example, one group could be people who do most of their work in the early morning and a little in the evening, another could be those who only work at a certain time, etc.
I have a few questions on how to go about this:
1- Since I am using percentages that add up to 1, I think my variables are not independent. Is the dependency a problem for clustering?
2- Is there a particular advantage of using Gaussian Mixture Models instead of KMeans here?
3- For evaluating the clustering, is .4 a good Silhouette score?
4- If the Silhouette Score for different number of clusters varies from .4 to .49, can I choose a number of clusters that does not give the highest Silhouette score but gives a more balanced number of observations in each sample (because I prefer having balanced classes)?
5- Is there a way to "toss" observations that are on the boundary of clusters, just to make clusters more dense and improve the Silhouette score?
6- Is reducing the number of variables a good idea? for example, I could merge early morning with late morning into one variable, so I would have 4 factors instead of 5. Does this usually help improve the clustering?
Thanks for any help!