I would like to process lots of data in S3 efficient with mrjob (using EMR). I can structure the data any way I would like, but clearly I would like to do everything I can to play to the strengths of having EMR run on S3 data.
My data consists of millions of web pages (each 50K, let's say). Intuitively, it makes sense to me to create a set of .tar.gz files (.tgz for short) that each have thousands of pages, such that the .tgz file sizes are around 2GB or so. I would like to then load these .tgz files onto S3 and write a mrjob task to process these (on, say, 10 EC2 instances).
I am attracted to building these .tgz files because they represent a very compressed form of the data, and thus they should minimize network traffic (size and thus latency to transfer). I am also attracted to building multiple .tgz files because I would obviously like to leverage the multiple EMR instances I am planning on allocating for the job.
If I have to, I could concat the files so that I avoid the archive (tar) step and just deal with .gz files, but it'd be easier to just tar the original data and then compress.
Am I thinking about this the right way, and if so, how can I configure/specify mrjob to decompress and un-tar such that an instance will process the entirety of just one of those .tgz files?