3

More specifically, is there a somewhat easy streaming solution?

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
zack
  • 31
  • 1

1 Answers1

2

See this link: How do I process files, one per map?

  • Upload your data to an S3 bucket
  • Generate a file containing the full s3n:// path to each file
  • Write a mapper script that:
    • Pulls 'mapred_work_output_dir' out of the environment (*)
    • Performs XSLT transform based on the name of the file, saving to the output directory
  • Write an identity reducer that does nothing
  • Upload your mapper / reducer scripts to an S3 bucket
  • Test your script via the AWS EMR console

(*) Streaming puts your jobconf in the processes environment. See code here.

Richard Padley
  • 352
  • 2
  • 8
Ryan Cox
  • 4,993
  • 2
  • 25
  • 18