Accessing Raw Data for Hadoop

Question

I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.

score 0 · Answer 1 · answered Aug 18 '12 at 23:58

In Windows Azure you can place your data sets (unstructured data etc..) in Windows Azure Storage and then access it from the Hadoop Cluster

Check out the blog post: Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster:

http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/05/apache-hadoop-on-windows-azure-connecting-to-windows-azure-storage-your-hadoop-cluster.aspx

You can also get your data from the Azure Marketplace e.g. Gov Data sets etc..

http://social.technet.microsoft.com/wiki/contents/articles/6857.how-to-import-data-to-hadoop-on-windows-azure-from-windows-azure-marketplace.aspx

score 0 · Accepted Answer · answered Aug 19 '12 at 12:42

It looks like data.seattle.gov is a self contained data service, not built on top of the public cloud. They have own Restful API for the data access.
Thereof I think the simplest way is to download interested Data to your hadoop cluster, or to S3 and then use EMR or own clusters on Amazon EC2.
If they (data.seattle.gov ) has relevant queries capabilities you can query the data on demand from Your hadoop cluster passing data references as input. It might work only if you doing very serious data reduction in these queries - otherwise network bandwidth will limit the performance.

Thanks. I feel like I understand the pieces on simple terms but hooking everything up still intimidates me a bit. — Russell Asher, Aug 21 '12 at 18:40

Accessing Raw Data for Hadoop

2 Answers2