0

For my thesis assignment I need to perform a cluster analysis on a high dimensional data set containing purchase data from a retail store (+1000 dimensions). Because traditional clustering algorithms are not well suited for high dimensions (and dimension reduction is not really an option), I would like to try algorithms specifically developed for high dimensional data(e.g. ProClus).

Here however, my problem starts. ProClusAlgorithm

I have no clue what value I should use for parameter d. Can anyone help me?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
JaperTIA
  • 129
  • 1
  • 2
  • 13

1 Answers1

0

This is just one of the many limitations of ProClus.

The parameter is the average dimensionality of your cluster. It assumes there is a linear cluster somewhere in your data. This likely will not hold for purchase data, but you can try. For sparse data such as purchases, I would rather focus on frequent itemset mining.

There is no universal clustering algorithm. Any clustering algorithm will come with a variety of parameters that you need to experiment with.

For cluster analysis it is essential that you somehow can visualize or analyze the result, to be able to find out if and how well the method worked.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • The assignment specifically asks for clustering the customers, not the products. Do you know an algorithm which can possibly handle the 1000+dimensional sparse matrix? – JaperTIA Mar 15 '16 at 13:14
  • Plenty of algorithms can *handle* it. The better question is: what is a good cluster, and how do I find it? - that is a question you need to answer. Because I don't think a ProClus cluster is a good cluster for customers. But you *can* cluster customers by the frequent itemsets which they bought. You get clusters of customers which have the same shopping behavior. (Beware, customers *may* be in multiple or none of the clusters; and that is *good*.) – Has QUIT--Anony-Mousse Mar 15 '16 at 14:17