We want to test the performance of some fuzzy clustering algorithms that some collaborators have developed. Our interest lies in 2D datasets with a lot of data, where we could benchmark these algorithms. Do you know where can one find such datasets?
3 Answers
One excellent dataset is the one provided by this website. StackExchange provides an anonymized dump of all publicly available data found on their sites here: https://archive.org/details/stackexchange
You can read about the data schema here: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
I have a copy of the data from a year ago and it has over 16 million records just for this site (StackOverflow.com) and the dump has all of their sites.

- 1
- 1

- 20,843
- 4
- 55
- 67
You can generate dataset from http://www.mockaroo.com/. It is pretty good an you can have many option.

- 716
- 8
- 21
There are many large "open data" collections with scientific data around the web. Some have rather, shall we say, nontrivial data set sizes of well over a Terabyte. So, depending on which size you need, take a look at genome sites like Proteomecommons or the datasets from Stanford's Natural Language Processing group.
Smaller dumps can be found in the geologists' projects like this one.

- 1,522
- 1
- 12
- 23