I'm missing something obvious about Yelp's mrjob job library. Setting up an MRJob class is almost trivially easy. Running it over a file or stdin also so. But how can I change the input to the job from a file either locally or in s3, to, say, keys in an s3 bucket?
Something like this. Suppose I wanted to count all objects in my S3 bucket that start with the string 'foo':
import re
class MRCountS3Objects(MRJob):
define mapper(self, _, botoS3Key):
if re.match('^foo', botoS3Key.name):
yield 'foo', 1
define reduce(self, name, occurrences):
yield name, sum(occurrences)
It's a highly contrived example, but you probably get my drift. How can I tell MRJob to operate over a stream of s3 objects, ignoring the content of the objects? I saw the S3Filesystem.get_s3_keys() method, which gets me exactly the stream I need, but I'm not sure where to go from there.