Decompress + un-tar input files during mrjob execution

Question

I would like to process lots of data in S3 efficient with mrjob (using EMR). I can structure the data any way I would like, but clearly I would like to do everything I can to play to the strengths of having EMR run on S3 data.

My data consists of millions of web pages (each 50K, let's say). Intuitively, it makes sense to me to create a set of .tar.gz files (.tgz for short) that each have thousands of pages, such that the .tgz file sizes are around 2GB or so. I would like to then load these .tgz files onto S3 and write a mrjob task to process these (on, say, 10 EC2 instances).

I am attracted to building these .tgz files because they represent a very compressed form of the data, and thus they should minimize network traffic (size and thus latency to transfer). I am also attracted to building multiple .tgz files because I would obviously like to leverage the multiple EMR instances I am planning on allocating for the job.

If I have to, I could concat the files so that I avoid the archive (tar) step and just deal with .gz files, but it'd be easier to just tar the original data and then compress.

Am I thinking about this the right way, and if so, how can I configure/specify mrjob to decompress and un-tar such that an instance will process the entirety of just one of those .tgz files?

score 0 · Answer 1 · answered Aug 10 '14 at 11:06

I strongly discourage you from the described approach pairing with EMR, as it clearly violates map-reduce paradigm. You will probably overcome network traffic bottleneck, if any, but will have hard time load balancing mappers. Map-reduce approach is record processing, scaling well when amount of records grows infinitely. Saying that, your architecture is viable, but better used with different task processing tools, for example with celery.

If you wish to process your data in map-reduce paradigm (with EMR and mrjob), you can, for example, compress and base64 encode each page to ensure a page is stored as a line in a text file. See following mrjob compatible protocol for a working sample:

import cPickle
import zlib
import base64

from mrjob import protocol

class CompressedPickleProtocol(protocol.PickleProtocol):
    """
    Protocol that compresses pickled `key` and `value` with 
    `zlib.compress` and encodes result with `base64.b64encode`.
    """

    @classmethod
    def encode(cls, value):
        return base64.b64encode(value)

    @classmethod
    def decode(cls, raw_value):
        return base64.b64decode(raw_value)

    @classmethod
    def _writes(cls, value):
        return cls.encode(zlib.compress(cPickle.dumps(value)))

    @classmethod
    def _reads(cls, raw_value):
        return cPickle.loads(zlib.decompress(cls.decode(raw_value)))

    @classmethod
    def read(cls, line):
        raw_key, raw_value = line.split('\t', 1)
        return (cls._reads(raw_key), cls._reads(raw_value))

    @classmethod
    def write(cls, key, value):
        return '%s\t%s' % (cls._writes(key),
                           cls._writes(value))

Decompress + un-tar input files during mrjob execution

1 Answers1