0

I have a pretty big data table (about 100.000 observations) that I'd like to use for clustering. Since some of the data is categorical, I've tried using "gower distance" and then hclust() with the "ward" method. The data itself is very heterogeneous, which is why I'd like to sort of "pre-cluster" the data and then do the actual cluster analysis. Have any of you done this before and can point me in the right direction? I'm at a loss at the moment :( With the mentioned methods, I don't really get useful clusters. Thanks guys, I really appreciate every tip I can get.

Edit: I think that I didn't really explain my problem right, so here's another attempt: let's say, that I have a dataset containing brands of cars and some of their features. Before clustering them by features I would like to precluster them by brand. So all BMW e.g. are in the same cluster and so on.. and only after that I would like to cluster by features, so I should get a cluster with fast cars etc. does anybody know, how to do this in R? this does not describe my dataset, but maybe the question I'm having is clearer now.

Anna
  • 3
  • 5

1 Answers1

0

You should start with a sample first.

Once you get good results on the sample, try to reproduce it on a different sample. Once the results are stable, you can either try to scale the algorithm to the entire data set (maybe try doubling first), or you can train a classifier and predict the clusters of the remaining data. With most clustering algorithms, a 1 nearest neighbor classifier will be very good.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194