Feature scaling (normalization) for clustering algorithms (as Kmeans & EM)

Question

I want to use KMeans clustering algorithm to analyze a profile data. The sample data is in the format of :

Features: name   ISBN     Date             ID      price ....
          'A'   '31NDB'  '05/18/2014'    'CBDDN'   12.00
          'B'   '3241B'  '08/19/2012/    'ABCDE'   33.08

These are just examples, the real data is not necessarily in this format. But if need to apply clustering algorithm on this set of data, how can do the feature scaling aka, normalization part? How should I treat the string value and the date value and the price (double) value? Is there a relationship between these values? I'm confused...

Any idea?

score 1 · Answer 1 · answered Oct 31 '14 at 02:58

1

K-means and EM are for numeric data only.

It does not make much sense to apply them on name/date/price typed data.

As the name indicates, the algorithm needs to compute means. How would you compute a mean in your "name" column? You can hack something for the date, but not for the name.

Wrong tool for your job.

answered Oct 31 '14 at 02:58

Has QUIT--Anony-Mousse

76,138
12
138
194

Then what should be the tool? For non-numeric data? Suppose I want to group similar books together? Or suppose I'm analyzing server log files.... – JudyJiang Oct 31 '14 at 11:24
Use e.g. topic modeling, which are meant to work on sparse textual data, with overlapping features, based on the presence and absence of words. – Has QUIT--Anony-Mousse Oct 31 '14 at 17:30

score 0 · Answer 2 · edited May 23 '17 at 11:57

You will have to encode the non-numeric features as numbers. This is the case for categorical or ordinal features.

Also, if certain features are unimportant to your analysis, consider throwing them away. For e.g., if you are trying to cluster books, then the purchase date might not be important (or it might be, depends on what you are concerned with), so adding the date won't make sense.

As an example for encoding a variable with 3 categories, you could for e.g., encode it as 3 variables [1, 0, 0], [0, 1, 0], [0, 0, 1], or as 2 variables [0, 0], [1, 0], [0, 1]. There is a bit more discussion on this here.

Note that as your KMeans/GMM(since you eluded to EM) is going to compute the distances between points, proper encoding is especially important. Understand what they entails, especially when used with the different feature normalization schemes, and try different ones to see the result.

So I'll have to transform these values (in some way) into numeric values? Say, date---date number, and name string --- (use some functions..) And also find the relationship between of them..? — JudyJiang, Oct 31 '14 at 11:25
Sorry I may not make some sense, new in machine learning.. is there any way or source I can read ? Thanks! — JudyJiang, Oct 31 '14 at 11:26

Feature scaling (normalization) for clustering algorithms (as Kmeans & EM)

2 Answers2