Amazon MapReduce input splitting and downloading

Question

I'm new to EMR and just had a few questions i have been struggling with the past few days. The first of which is the logs that i want to process are already compressed as .gz and i was wondering if these types of files are able to be split by emr so that more then one mapper will work on a file. Also i have been reading that input files will not be split unless they are 5gb, my files are not that large so does that mean they will only be processed by one instance?

My other question might seem relatively dumb but is it possible to use emr+streaming and have an input someplace other then s3? It seems redundant to have to download the logs from the CDN, then upload them to my s3 bucket to run mapreduce on them. Right now i have them downloading onto my server then my server is uploading them to s3, is there a way to cut out the middle man and have it go straight to s3, or run the inputs off my server?

score 3 · Accepted Answer · answered Dec 30 '11 at 18:07

are already compressed as .gz and i was wondering if these types of files are able to be split by emr so that more then one mapper will work on a file

Alas, no, straight gzip files are not splittable. One option is to just roll your log files more frequently; this very simple solution works for some people though it's a bit clumsy.

Also i have been reading that input files will not be split unless they are 5gb,

This is definitely not the case. If a file is splittable you have lots of options on how you want to split it, eg configuring mapred.max.split.size. I found [1] to be a good description of the options available.

is it possible to use emr+streaming and have an input someplace other then s3?

Yes. Elastic MapReduce now supports VPC so you could connect directly to your CDN [2]

[1] http://www.scribd.com/doc/23046928/Hadoop-Performance-Tuning

[2] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EnvironmentConfig_VPC.html?r=146

Amazon MapReduce input splitting and downloading

1 Answers1