-1

In section 5.A of a research paper the researcher used the following synthetic datasets:

  1. GAUSS consisted of six Gaussian clusters with identity covariance, each with 500 points in five dimensions. Their means were randomly assigned a value from zero to 10 in each dimension. Cluster means were required to be at least four Euclidean distance apart, and points were required to within two Euclidean distance of their cluster mean.
  2. PAIRED consisted of three pairs of Gaussian clusters with identity covariance, each with 500 points in five dimensions. Each pair of Gaussians was placed around a mean with a randomly assigned value in each dimension from zero to 20 such that the Euclidean distance between paired Gaussian clusters was between four and eight, and the Euclidean distance between non-paired Gaussians was at least 12. Additionally, points were required to be within two Euclidean distance of their cluster mean.

  3. ELONG consisted of five Gaussian clusters with identity covariance, each with 300 points in five dimensions. Their means were randomly assigned a value from zero to 50 in each dimension. To create elongated clusters in different dimensions, we multiplied the values of a single, distinct dimension for each cluster by 15. Cluster means were required to be at least five Euclidean distance apart.

  4. UNIFORM consisted of eight clusters, each with 300 points in three dimensions. Each cluster had its points uniformly distributed in a 3x3x3 box around a randomly assigned center in a 10x10x10 cube. Cluster centers were required to be five Euclidean distance apart.
  5. RINGS consisted of 2 ring clusters centered around (0,0), a larger outer ring with radius 2 and a smaller inner ring of radius 1. 400 points were evenly spaced by degrees on the inner ring.

http://postimg.org/image/jo4rjztjz/


I don't have these datasets. I tried to contact the researcher but of no use.

How to create these datasets? Is there any kind of tool to create them?

Original Paper can be found here

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Ramseyl
  • 1
  • 2

1 Answers1

1

Documentation and examples on the ELKI data set generator can be found here: http://elki.dbs.ifi.lmu.de/wiki/DataSetGenerator

The generator in ELKI currently cannot produce ring-shaped clusters (only spherical), and also does not support clipping points at a certain maximum distance. It generates independent samples for each dimension independently. The only supported operation that uses more than one dimension at a time is the rotation operation. Generating ring-shaped clusters, or clipping clusters based on distance from a mean means a form of dependence of values that is currently not supported.

You will need to either contact the authors of that publication, or write a program to generate such data yourself. It's not that hard; but it may not be worth the effort to generate such synthetic data - it's not a realistic scenario in my opinion.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42