0

The Hadoop documentation states it's possible to make files available locally by use of the -file option.

How can I do this using the Elastic MapReduce Ruby CLI?

Matt Joiner
  • 112,946
  • 110
  • 377
  • 526
  • Can you be more specific on what you're trying to do? Locally to what do you need to make the files available? – Charles Menguy Jan 17 '13 at 02:44
  • @CharlesMenguy: Locally to the map/reduce tasks. Hadoop lets you takes those files from the location you invoke Hadoop and have them available to the map/reduce tasks automatically. – Matt Joiner Jan 17 '13 at 23:37

1 Answers1

0

You could use the DistributedCache with EMR to do this.

With the ruby client this can be done with the following option:

`--cache <path_to_file_being_cached#name_in_current_working_dir>`

It places a single file in the DistributedCache. It lets you specify the location (s3n or hdfs) of the file followed by its name as referenced in the current working directory of the application, and will place the file locally on your task nodes on the directory identified by mapred.local.dir (I think).

You can then access the files in your Mapper/Reducer tasks easily. I believe you can directly access it just like any normal file, but you may have to do something like DistributedCache.getLocalCacheFiles(job); in the setup method of your tasks.

An example to do this in the Ruby client taken from Amazon's forums:

./elastic-mapreduce --create --stream --input s3n://your_bucket/wordcount/input --output s3n://your_bucket/wordcount/output --mapper s3n://your_bucket/wordcount/wordSplitter.py --reducer aggregate --cache s3n://your_bucket/wordcount/stop-word-list#stop-word-list
Charles Menguy
  • 40,830
  • 17
  • 95
  • 117