4

This is more of a theoretical question:

do you know any clustering algorithm (flat or hierarchical) which does not require any input parameters, like the number of clusters or size of the neighborhood etc? in other words, you simply feed your data to the algorithm as input and get clusters as output.

I will be glad if advised on the relevant papers/documentation.

Alina
  • 146
  • 8

2 Answers2

2

Determining the number of clusters automatically is really a tough problem in still considered to be open research problem.

One of the most advanced clustering techniques is to model your data as Dirichlet Process Mixture see Bayesian Hierarchical Clustering, but it is not trivial and require solid background in Bayesian methods and estimation with Markov chain Monte Carlo (MCMC).

Such method can estimate the number of clusters automatically.

iTech
  • 18,192
  • 4
  • 57
  • 80
  • 1
    This isn't quite parameter free, as you have to set the concentration parameter on the Dirichlet process, but you can make a reasonable argument that this isn't *that* important, relative to the number of clusters. You do need to worry about specifying the likelihood function though, which might be non-trivial if you're not statistically minded. – Ben Allison Feb 08 '13 at 09:57
  • You are right, it is really hard to think about absolutely parameter-free technique or in this case the concentration parameter can be considered a *hyper-parameter*. – iTech Feb 08 '13 at 16:56
0

Usually, the answer presents itself once you define what you mean by clustering. This is the hard part.

With real-valued data, I like to use mean shift with automatic h selection. The clusters correspond to the modes in the data density plot and the grouping result is similar to a watershed transform.

http://en.wikipedia.org/wiki/Mean-shift
http://en.wikipedia.org/wiki/Kernel_density_estimation
http://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation

Don Reba
  • 13,814
  • 3
  • 48
  • 61