2

We want to test the performance of some fuzzy clustering algorithms that some collaborators have developed. Our interest lies in 2D datasets with a lot of data, where we could benchmark these algorithms. Do you know where can one find such datasets?

Open the way
  • 26,225
  • 51
  • 142
  • 196

3 Answers3

2

One excellent dataset is the one provided by this website. StackExchange provides an anonymized dump of all publicly available data found on their sites here: https://archive.org/details/stackexchange

You can read about the data schema here: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

I have a copy of the data from a year ago and it has over 16 million records just for this site (StackOverflow.com) and the dump has all of their sites.

Community
  • 1
  • 1
Michael Minella
  • 20,843
  • 4
  • 55
  • 67
0

You can generate dataset from http://www.mockaroo.com/. It is pretty good an you can have many option.

Eliott Roynette
  • 716
  • 8
  • 21
0

There are many large "open data" collections with scientific data around the web. Some have rather, shall we say, nontrivial data set sizes of well over a Terabyte. So, depending on which size you need, take a look at genome sites like Proteomecommons or the datasets from Stanford's Natural Language Processing group.

Smaller dumps can be found in the geologists' projects like this one.

jstarek
  • 1,522
  • 1
  • 12
  • 23