Datasets for benchmarking Fuzzy Clustering method with millions of data

Question

We want to test the performance of some fuzzy clustering algorithms that some collaborators have developed. Our interest lies in 2D datasets with a lot of data, where we could benchmark these algorithms. Do you know where can one find such datasets?

score 2 · Accepted Answer · edited Mar 20 '17 at 10:29

One excellent dataset is the one provided by this website. StackExchange provides an anonymized dump of all publicly available data found on their sites here: https://archive.org/details/stackexchange

You can read about the data schema here: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

I have a copy of the data from a year ago and it has over 16 million records just for this site (StackOverflow.com) and the dump has all of their sites.

score 0 · Answer 2 · answered Jun 02 '14 at 11:35

0

You can generate dataset from http://www.mockaroo.com/. It is pretty good an you can have many option.

answered Jun 02 '14 at 11:35

Eliott Roynette

716
8
21

score 0 · Answer 3 · answered Jun 02 '14 at 14:12

There are many large "open data" collections with scientific data around the web. Some have rather, shall we say, nontrivial data set sizes of well over a Terabyte. So, depending on which size you need, take a look at genome sites like Proteomecommons or the datasets from Stanford's Natural Language Processing group.

Smaller dumps can be found in the geologists' projects like this one.

Datasets for benchmarking Fuzzy Clustering method with millions of data

3 Answers3