Hadoop Basics: What do I do with the output?

Question

(I'm sure a similar question exists, but I haven't found the answer I'm looking for yet.)

I'm using Hadoop and Hive (for our developers with SQL familiarity) to batch process multiple terabytes of data nightly. From an input of a few hundred massive CSV files, I'm outputting four or five fairly large CSV files. Obviously, Hive stores these in HDFS. Originally these input files were extracted from a giant SQL data warehouse.

Hadoop is extremely valuable for what it does. But what's the industry standard for dealing with the output? Right now I'm using a shell script to copy these back to a local folder and upload them to another data warehouse.

This question: ( Hadoop and MySQL Integration ) calls the practice of re-importing Hadoop exports non-standard. How do I explore my data with a BI tool, or integrate the results into my ASP.NET app? Thrift? Protobuf? Hive ODBC API Driver? There must be a better way.....

Enlighten me.

"Thrift? Protobuf? Hive ODBC API Driver? There must be a better way..." Could you be more specific about what you don't like about these? Or about copying it from the HDFS? They all seem like good options for lots of different cases. — Tim Yates, May 17 '11 at 16:51
Sure. There are lots of different options, and I want to know what the most common way of dealing with output. Does anyone know what Yahoo uses? Other web analytics companies? There's a lot of noise in this space right now, and we want to make sure that there will be significant support for our decision. — batman, May 17 '11 at 16:56
That's not to say we don't like the options. They each have compelling use cases, especially Thrift and Protobuf. But is this the *right* way to access Hadoop output? (Highly subjective, but worth considering) — batman, May 17 '11 at 16:57
That's understandable. In that case, this sounds like a Community Wiki question. Like you said, its highly subjective--there are lots of possible answers, and you're more interested in exploring what different people are using for what reasons. — Tim Yates, May 17 '11 at 17:00

score 3 · Answer 1 · answered Jun 07 '11 at 20:43

At foursquare I'm using Hive's Thrift driver to put the data into databases/spreadsheets as needed.

I maintain a job server that executes jobs via the Hive driver and then moves the output wherever it is needed. Using thrift directly is very easy and allows you to use any programming language.

If you're dealing with hadoop directly (and can't use this) you should check out Sqoop, built by Cloudera

Sqoop is designed for moving data in batch (whereas Flume is designed for moving it in real-time, and seems more aligned with putting data into hdfs than taking it out).

Hope that helps.

Great answer. Thanks. I haven't seen much other than Facebook slideshows in the way of real-life workflows. — batman, Jun 17 '11 at 17:22

Hadoop Basics: What do I do with the output?

1 Answers1