How to select features for clustering?

Question

I had time-series data, which I have aggregated into 3 weeks and transposed to features.

Now I have features: A_week1, B_week1, C_week1, A_week2, B_week2, C_week2, and so on. Some of features are discreet, some - continuous.

I am thinking of applying K-Means or DBSCAN.

How should I approach the feature selection in such situation? Should I normalise the features? Should I introduce some new ones, that would somehow link periods together?

score 0 · Answer 1 · edited Feb 09 '19 at 08:44

0

Since K-means and DBSCAN are unsupervised learning algorithms, selection of features over them are tied to grid search. You may want to test them to evaluate such algorithms based on internal measures such as Davies–Bouldin index, Silhouette coefficient among others. If you're using python you can use Exhaustive Grid Search to do the search. Here is the link to the scikit library.

edited Feb 09 '19 at 08:44

Has QUIT--Anony-Mousse

76,138
12
138
194

answered Feb 07 '19 at 21:24

Bitzel

61
4

score 0 · Answer 2 · answered Feb 09 '19 at 08:54

Formalize your problem, don't just hack some code.

K-means minimizes the sum of squares. If the features have different scales they get different influence on the optimization. Therefore, you carefully need to choose weights (scaling factors) of each variable to balance their importance the way you want (and note that a 2x scaling factor does not make the variable twice as important).

For DBSCAN, the distance is only a binary decision: close enough, or not. If you use the GDBSCAN version, this is easier to understand than with distances. But with mixed variables, I would suggest to use the maximum norm. Two objects are then close if they differ in each variable by at most "eps". You can set eps=1, and scale your variables such that 1 is a "too big" difference. For example in discrete variables, you may want to tolerate one or two discrete steps, but not three.

Logically, it's easy to see that the maximum distance threshold decomposes into a disjunction of one-variablea clauses:

 maxdistance(x,y) <= eps
 <=>
 forall_i |x_i-y_i| <= eps

How to select features for clustering?

2 Answers2