8

I am looking for a clustering dataset with "ground truth" labels for some known natural clustering, preferably with high dimensionality.

I found some good candidates here (http://cs.joensuu.fi/sipu/datasets/), but only the Glass and Iris data-sets have labels for the points. I also found some code to generate Gaussian datasets (SynDECA). The main reason I want this is to compare distance metrics for some clustering methods. It's difficult to use external (extrinsic) evaluation criteria as many of those are biased towards euclidean distances; and there are so many to choose from.

Thanks!

gagolews
  • 12,836
  • 2
  • 50
  • 75
user3457088
  • 81
  • 1
  • 2

2 Answers2

2

Apart from the mentioned SIPU and UCI ML repositories, here is a list of other clustering benchmarks aggregators:

gagolews
  • 12,836
  • 2
  • 50
  • 75
-1

There are many data sets at the UCI Machine Learning Repository.

jcrudy
  • 3,921
  • 1
  • 24
  • 31
  • 1
    Thanks for the reply. I've looked at this repo quickly, but can't find a dataset that has a "known" natural clustering. You have classification datasets; but data that shares a class may not be in the same clusters. What I need is a dataset that has been generated or is otherwise known to contain an intrinsic "correct" clustering (like the Iris or Glass sets). Something like Attribute 1...Attribute n, then an additional column that says Cluster #. To be honest, I'm not sure if such data can really exist; as the "correct" clustering tends to be subjective (especially for HD data). – user3457088 Mar 24 '14 at 21:00
  • I have to agree that what you want might not be a real thing. When I think of "correct clusters", in my mind that is equivalent to a classification problem. – jcrudy Mar 24 '14 at 22:16
  • Clusters != Classes. Most of the time, you will have clusters within a class, and classes may in turn cluster. Consider the iris data set: two of the iris species clearly cluster. – Has QUIT--Anony-Mousse Mar 25 '14 at 01:07
  • @Anony-Mousse However, if you have a set of clusters that is "correct", in the sense of being based on some observed characteristic not included in the set of predictors, that is not a cluster but rather a class, no? Perhaps I am not correctly understanding what user3457088 is asking for. – jcrudy Mar 25 '14 at 02:04
  • I'm not aware of any data sets where someone labeled actual clusters either. Usually, labeling is done goal-oriented (i.e. classes), not so much observational as in "these objects seem to be more closely related than others", even though others have the same function. – Has QUIT--Anony-Mousse Mar 25 '14 at 06:50