-1

I had time-series data, which I have aggregated into 3 weeks and transposed to features.

Now I have features: A_week1, B_week1, C_week1, A_week2, B_week2, C_week2, and so on. Some of features are discreet, some - continuous.

I am thinking of applying K-Means or DBSCAN.

How should I approach the feature selection in such situation? Should I normalise the features? Should I introduce some new ones, that would somehow link periods together?

katyapush
  • 11
  • 1
  • 2

2 Answers2

0

Since K-means and DBSCAN are unsupervised learning algorithms, selection of features over them are tied to grid search. You may want to test them to evaluate such algorithms based on internal measures such as Davies–Bouldin index, Silhouette coefficient among others. If you're using python you can use Exhaustive Grid Search to do the search. Here is the link to the scikit library.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Bitzel
  • 61
  • 4
0

Formalize your problem, don't just hack some code.

K-means minimizes the sum of squares. If the features have different scales they get different influence on the optimization. Therefore, you carefully need to choose weights (scaling factors) of each variable to balance their importance the way you want (and note that a 2x scaling factor does not make the variable twice as important).

For DBSCAN, the distance is only a binary decision: close enough, or not. If you use the GDBSCAN version, this is easier to understand than with distances. But with mixed variables, I would suggest to use the maximum norm. Two objects are then close if they differ in each variable by at most "eps". You can set eps=1, and scale your variables such that 1 is a "too big" difference. For example in discrete variables, you may want to tolerate one or two discrete steps, but not three.

Logically, it's easy to see that the maximum distance threshold decomposes into a disjunction of one-variablea clauses:

 maxdistance(x,y) <= eps
 <=>
 forall_i |x_i-y_i| <= eps
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194