I'm new to EMR and just had a few questions i have been struggling with the past few days. The first of which is the logs that i want to process are already compressed as .gz and i was wondering if these types of files are able to be split by emr so that more then one mapper will work on a file. Also i have been reading that input files will not be split unless they are 5gb, my files are not that large so does that mean they will only be processed by one instance?
My other question might seem relatively dumb but is it possible to use emr+streaming and have an input someplace other then s3? It seems redundant to have to download the logs from the CDN, then upload them to my s3 bucket to run mapreduce on them. Right now i have them downloading onto my server then my server is uploading them to s3, is there a way to cut out the middle man and have it go straight to s3, or run the inputs off my server?