small data sets for Hadoop-MapReduce

Question

I am trying to be familiar with Hadoop-MapReduce. After studying theoratical knowledge on this concepts, I want to do practise on them.

However, I could not find small data sets ( up to 3 Gb ) for this technology. Where can I find data sets in order to do practise ?

OR, How can I do practise Hadoop-MapReduce ? In other words, are there any tutorial or website which offers exercise ?

saurabh shashank · Answer 1 · 2012-10-16T14:35:02.003

publicly accessible data sets that you can download and play around with. Below are a few examples.

http://www.netflixprize.com/index— As part of a competition, it released a data set of user ratings to challenge people to develop better recommendation algorithms. The uncompressed data comes at 2 GB+. It contains 100 M+ movie ratings from 480 K users on 17 K movies.

http://aws.amazon.com/publicdatasets/— For example, one of the biological data sets is an annotated human genome data of roughly 550 GB. Under economics you can find data sets, such as the 2000 U.S. Census (approximately 200 GB).

http://boston.lti.cs.cmu.edu/Data/clueweb09/—Carnegie Mellon University’s Language Technologies Institute has released the ClueWeb09 data set to aid large-scale web research. It’s a crawl of a billion web pages in 10 languages. The uncompressed data set takes up 25 TB.

score 5 · Answer 2 · answered Oct 16 '12 at 13:11

5

Why not create some data sets yourself?

A very simple thing would be to fill a file with millions of random numbers and then use Hadoop to find duplicates, triples, prime numbers, numbers which have duplicates in their factors and so on.

Sure, it's not as funny as finding common facebook friends, but it should suffice to get a bit of Hadoop practice.

answered Oct 16 '12 at 13:11

rolve

10,083
4
55
75

1

It is time consuming and not a good practise for me. Working on meaningful data sets, in my opinion, will improve my problem-solving ability. – user1743323 Oct 16 '12 at 13:13
I disagree with both, but of course it's up to you to decide. I think simple artificial data is a better place to start as you don't have to understand and parse or pre-process it first. Also, with simple tasks and simple data you can more easily verify that your program is actually working. Good luck doing that with genomes or movie ratings. – rolve Oct 16 '12 at 15:53

score 3 · Answer 3 · edited May 23 '17 at 12:22

OR, How can I do practise Hadoop-MapReduce ? In other words, are there any tutorial or website which offers exercise ?

Here are some of the toy problems to get started. Also check Data-Intensive Text Processing with MapReduce, it has got pseudo-code for the some of the algorithms like page-rank, joins, indexing implemented in MapReduce.

Here are some of the public data sets collected over time. You might have to dig for small ones.

http://wiki.gephi.org/index.php/Datasets
Download large data for Hadoop
http://datamob.org/datasets
http://konect.uni-koblenz.de/
http://snap.stanford.edu/data/
http://archive.ics.uci.edu/ml/
https://bitly.com/bundles/hmason/1
http://www.inside-r.org/howto/finding-data-internet
https://docs.google.com/document/pub?id=1CNBmPiuvcU8gKTMvTQStIbTZcO_CTLMvPxxBrs0hHCg
http://ftp3.ncdc.noaa.gov/pub/data/noaa/1990/
http://data.cityofsantacruz.com/

small data sets for Hadoop-MapReduce

3 Answers3