I have a Pig job which analyzes log files and write summary output to S3. Instead of writing the output to S3, I want to convert it to a JSON payload and POST it to a URL.
Some notes:
- This job is running on Amazon Elastic MapReduce.
- I can use a STREAM to pipe the data through an external command, and load it from there. But because Pig never sends an EOF to external commands, this means I need to POST each row as it arrives, and I can't batch them. Obviously, this hurts performance.
What's the best way to address this problem? Is there something in PiggyBank or another library that I can use? Or should I write a new storage adapter? Thank you for your advice!