Download large data for Hadoop

Question

I need a large data (more than 10GB) to run Hadoop demo. Anybody known where I can download it. Please let me know.

score 28 · Answer 1 · answered Jun 01 '12 at 05:12

I would suggest you downloading million songs Dataset from the following website:

http://labrosa.ee.columbia.edu/millionsong/

The best thing with Millions Songs Dataset is that you can download 1GB (about 10000 songs), 10GB, 50GB or about 300GB dataset to your Hadoop cluster and do whatever test you would want. I love using it and learn a lot using this data set.

To start with you can download dataset start with any one letter from A-Z, which will be range from 1GB to 20GB.. you can also use Infochimp site:

http://www.infochimps.com/collections/million-songs

In one of my following blog I showed how to download 1GB dataset and run Pig scripts:

http://blogs.msdn.com/b/avkashchauhan/archive/2012/04/12/processing-million-songs-dataset-with-pig-scripts-on-apache-hadoop-on-windows-azure.aspx

it would be nice to answer the question in your block how you get the tsv.m file. Many people asking for this and stuck stuck to follow the guide. Thanks for writing it though! — ruedi, Jun 11 '19 at 13:53

Jagadish Talluri · Answer 2 · 2013-12-03T10:56:32.257

16

Tom White mentioned about a sample weather data set in his Book(Hadoop: the definitive guide).

http://hadoopbook.com/code.html

Data is available for more than 100 years.

I used wget in linux to pull the data. For the year 2007 itself the data size is 27 GB.

It is hosted as an FTP link. So, you can download with any FTP utility.

ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

For complete details please check my blog:

http://myjourneythroughhadoop.blogspot.in/2013/07/how-to-download-weather-data-for-your.html

edited Dec 03 '13 at 10:56

answered Dec 03 '13 at 05:54

Jagadish Talluri

688
5
13

1

It is ok to link to your blog, but it is better to include the actual instruction if possible. That way it can be searched and it is easier to read then it is to follow a link away from the site. It is a good resource, thanks for adding it. – Joshua Wilson Dec 03 '13 at 06:19
@Joshua Wilson: I thought of not to repeat the same information. That's the only reason, Or else I would love to add. Thanks for the suggestion, I updated it now. – Jagadish Talluri Dec 03 '13 at 10:51

score 10 · Answer 3 · answered Jun 01 '12 at 16:08

There are public datasets availbale on Amazon:
http://aws.amazon.com/publicdatasets/
I would suggest to consider running demo cluster there - and thus to save downloading.
There is also good dataset of the crowled web from Common Crawl, which is also available on amazon s3. http://commoncrawl.org/

score 10 · Answer 4 · answered Jun 03 '12 at 13:34

An article that might be of interest to you, "Using Hadoop to analyze the full Wikipedia dump files using WikiHadoop".

If you are after Wikipedia page view statistics, then this might help. You can download pagecount files from 2007 up until current date. Just to give an idea of the size of the files, 1.9 GB for a single day (here I chose 2012-05-01) spread across 24 files.

Currently, 31 countries have sites which make available public data in various formats, http://www.data.gov/opendatasites. In addition, the World Bank makes available data at http://data.worldbank.org/data-catalog

score 3 · Answer 5 · answered Mar 19 '13 at 13:52

What about "Internet Census 2012", data gathered by a distributed scan over the whole Internet:

Announcement: http://seclists.org/fulldisclosure/2013/Mar/166

Data: http://internetcensus2012.bitbucket.org/

The whole data is 7TB, (obviously) only available by torrent.

score 3 · Answer 6 · answered Feb 20 '14 at 07:16

If you are interested in countries indicators, the best source I found was worldbank.org. The data they offer can be exported as CSV which makes it very easy to work with in Hadoop. If you are using .NET, I wrote a blogpost http://ryanlovessoftware.blogspot.ro/2014/02/creating-hadoop-framework-for-analysing.html where you can see how the data looks, and if you download the code from gidhub https://github.com/ryan-popa/Hadoop-Analysis, you already have the string parsing methods.

score 0 · Answer 7 · answered Jun 01 '12 at 04:40

0

It might be faster to generate the data than it is to download it and put it up. This has the advantage of giving you control of the problem domain and letting your demo mean something to the people who are watching.

answered Jun 01 '12 at 04:40

Mark Roberts

462
4
6

Yeah but it gives not interest to develop real and interesting algorithms to analyse the data – Kartoch Mar 15 '13 at 20:53
This is a good idea when combined with some type of genetic algorithm or something-- then you can analyze data to look for meaning. – David Betz Jun 19 '13 at 01:34

Download large data for Hadoop

7 Answers7

Linked

Related