0

I've run Hive on elastic mapreduce in interactive mode:

./elastic-mapreduce --create --hive-interactive

and in script mode:

./elastic-mapreduce --create --hive-script --arg s3://mybucket/myfile.q

I'd like to have an application (preferably in PHP, R, or Python) on my own server be able to spin up an elastic mapreduce cluster and run several Hive commands while getting their output in a parsable form.

I know that spinning up a cluster can take some time, so maybe my application might have to do that in a separate step and wait for the cluster to become ready. But is there any way to do something like this somewhat concrete hypothetical example:

  1. create Hive table customer_orders
  2. run Hive query "SELECT dt, count(*) FROM customer_orders GROUP BY dt"
  3. wait for result
  4. parse result in PHP
  5. run Hive query "SELECT MAX(id) FROM customer_orders"
  6. wait for result
  7. parse result in PHP ...

Does anyone have any recommendations on how I might do this?

dubois
  • 211
  • 2
  • 4
  • 10

1 Answers1

1

You may use MRJOB. It lets you write MapReduce jobs in Python 2.5+ and run them on several platforms.

An alternative is HiPy, it is an awesome project which should perhaps be enough for all your needs. The purpose of HiPy is to support programmatic construction of Hive queries in Python and easier management of queries, including queries with transform scripts.

HiPy enables grouping together in a single script of query construction, transform scripts and post-processing. This assists in traceability, documentation and re-usability of scripts. Everything appears in one place and Python comments can be used to document the script.

Hive queries are constructed by composing a handful of Python objects, representing things such as Columns, Tables and Select statements. During this process, HiPy keeps track of the schema of the resulting query output.

Transform scripts can be included in the main body of the Python script. HiPy will take care of providing the code of the script to Hive as well as of serialization and de-serialization of data to/from Python data types. If any of the data columns contain JSON, HiPy takes care of converting that to/from Python data types too.

Check out the Documentation for details!

Amar
  • 11,930
  • 5
  • 50
  • 73
  • Thank you! I had not heard of HiPy. I will check that out. I recently heard of mrjob, but that doesn't use Hive as far as I know. I recently came across the AWS PHP SDK 1 which it seems like I can use: http://aws.amazon.com/sdkforphp/ – dubois Dec 21 '12 at 21:52