More specifically, is there a somewhat easy streaming solution?
Asked
Active
Viewed 709 times
1 Answers
2
See this link: How do I process files, one per map?
- Upload your data to an S3 bucket
- Generate a file containing the full s3n:// path to each file
- Write a mapper script that:
- Pulls 'mapred_work_output_dir' out of the environment (*)
- Performs XSLT transform based on the name of the file, saving to the output directory
- Write an identity reducer that does nothing
- Upload your mapper / reducer scripts to an S3 bucket
- Test your script via the AWS EMR console
(*) Streaming puts your jobconf in the processes environment. See code here.

Richard Padley
- 352
- 2
- 8

Ryan Cox
- 4,993
- 2
- 25
- 18