Running a hadoop pig script on a Kinesis stream through aws EMR

Question

I am trying to batch process some data in a kinesis stream using a pig script on AWS EMR. I just need to group the stream data and move it to s3. I'm trying to run this every couple hours. At first it seems like a great fit for AWS Data Pipeline, but I can't figure out how to pass in an iteration number to use for kinesis checkpointing. It doesn't look like there's any way to increments a number to pass through to the pig script. I've seen the example here, which involves an always on cluster and a crontab script, which increment the iteration number. Is there a way achieve this using the AWS Data Pipeline that I'm missing?

score 1 · Answer 1 · answered Aug 17 '15 at 21:21

We do have an example of using Data Pipeline to accomplish what you want, but it uses Hive instead of Pig. This just might be enough to give you an idea to set you on the right path.

https://github.com/awslabs/data-pipeline-samples/tree/master/samples/kinesis

If this example still does not answer your question, please do let us know so we can maybe look into creating another example that addresses your use case.

Running a hadoop pig script on a Kinesis stream through aws EMR

1 Answers1