The Hadoop documentation states it's possible to make files available locally by use of the -file
option.
How can I do this using the Elastic MapReduce Ruby CLI?
The Hadoop documentation states it's possible to make files available locally by use of the -file
option.
How can I do this using the Elastic MapReduce Ruby CLI?
You could use the DistributedCache
with EMR to do this.
With the ruby client this can be done with the following option:
`--cache <path_to_file_being_cached#name_in_current_working_dir>`
It places a single file in the DistributedCache
. It lets you specify the location (s3n or hdfs) of the file followed by its name as referenced in the current working directory of the application, and will place the file locally on your task nodes on the directory identified by mapred.local.dir
(I think).
You can then access the files in your Mapper
/Reducer
tasks easily. I believe you can directly access it just like any normal file, but you may have to do something like DistributedCache.getLocalCacheFiles(job);
in the setup
method of your tasks.
An example to do this in the Ruby client taken from Amazon's forums:
./elastic-mapreduce --create --stream --input s3n://your_bucket/wordcount/input --output s3n://your_bucket/wordcount/output --mapper s3n://your_bucket/wordcount/wordSplitter.py --reducer aggregate --cache s3n://your_bucket/wordcount/stop-word-list#stop-word-list