0

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max). So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.

But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).

hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).

Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.

The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?

Edited: I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(

prog_guy
  • 796
  • 3
  • 7
  • 24

1 Answers1

0

You can use AWS Data Pipeline:

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available

You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html

SelimN
  • 212
  • 1
  • 2
  • 8
  • Hi, That was also of the options I'd considered. However, Data Pipeline service is currently not supported in my region. – prog_guy Jun 20 '14 at 08:24