I want to benchmark some (graph) databases and looking for some big, complex datasets. The dataset should have a size between 2 TB and 5 TB. Do you know any sample datasets (maybe open government or science data) which fullfills these criteria?
Asked
Active
Viewed 129 times
1 Answers
2
These should fit your requirements
- The 1000 Genomes project makes 260 TB of human genome data available
- The Internet Archive is making an 80 TB web crawl available for research
- The TREC conference made the ClueWeb09 dataset available a few years back. You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
- ClueWeb12 is now available, as are the Freebase annotations, FACC1
- CNetS at Indiana University makes a 2.5 TB click dataset available
- ICWSM made a large corpus of blog posts available for their 2011 conference. You'll have to register (an actual form, not an online form), but it's free. It's about 2.1 TB compressed.
- The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project, is 1.1 TB in size.
There are several others over 100 GB in size.

Rishi Dua
- 2,296
- 2
- 24
- 35