1

I have a MapReduce project I am working on (specifically I am using Python and the library MrJob and plan on running using Amazon's EMR). Here is an example to sum up the issue I am having:

I have thousands of GB of json files full of customer data. I need to go and run daily, weekly, and monthly reports on each customer json line/input/object.

So for the Map step I currently do:

map_step(_, customer_json_object)
    c_uuid = customer_json_object.uuid
    if customer_json_object.time is in daily_time_range:
        yield "%s-%s" % (DAILY_CONSTANT, c_uuid), customer_json_object
    if customer_json_object.time is in weekly_time_range:
        yield "%s-%s" % (WEEKLY_CONSTANT, c_uuid), customer_json_object
    if customer_json_object.time is in monthly_time_range:
        yield "%s-%s" % (MONTHLY_CONSTANT, c_uuid), customer_json_object

And then for the reducer

reducer_step(key, customer_info)
    report_type, c_uuid = key.split("-")
    yield None, Create_Report(report_type, customer_info)

My question is:

Am I guaranteed here that all my data with the same key (meaning here all data for a specific customer and specific report type) will be handled by the same reducer? My Create_Report cannot be spread across multiple processes and therefore I need all the data required for a report to be handled by one process.

I am worried that it might be possible that if there are too many values for a key then they are spread out among reducers or something. However from what I read it sounds like this is how it works.

Thank you so much!! I just realized I needed to yield multiple times from the map step so this is the last piece of the puzzle for me. If this can be figured out it will be a huge win as I cannot scale my little server any farther vertically...

If it is not clear from the code above I have thousands of files of json lines of customer (or really users, no one is paying me anything) data. I want to be able to create reports for this data and the report code is generated differently depending on if its monthly, weekly, or daily. I actually am also de duplifying the data before this but this is my last step, actually producing the output. I really appreciate you taking the time to read this and help!!

Brad Barrows
  • 1,633
  • 1
  • 13
  • 12
  • Havent read whole stuff but from subject line, Use partiotioner to guarantee that. – SMA Feb 17 '15 at 06:53

1 Answers1

3

In MapReduce in general and in the Phyton library MrJob it applies that:

A reducer takes a key and the complete set of values for that key in the current step, and returns zero or more arbitrary (key, value) pairs as output.

from: MrJob Documentation - https://pythonhosted.org/mrjob/guides/concepts.html#mapreduce-and-apache-hadoop

So back to your question:

Am I guaranteed here that all my data with the same key ... will be handled by the same reducer?

Yes, in addition it is the case that all your values belonging to the same key are passed to the same call of your reducer.

  • Thank you so much! I saw this comment everywhere once I started re reading up on map reduce but just had a hard time believing it because that simple missing piece (me missing that fact) had prevented me from getting my project from working for so long.. Thank you so much! – Brad Barrows Feb 17 '15 at 17:02
  • So what u are adding here with " the same call of your reducer." is that not only is it the same reducer but its the same "method call" that gets all values for a key right? So then each reducer doesnt need to be for any one specific Report Type for a customer (which is my key), they can create all the different report types for a customer? – Brad Barrows Feb 17 '15 at 23:19
  • A single reducer can handle multiple keys (and hence reduce calls) to avoid the overhead of spawning too many reducers, but by default the distribution of keys to reducers is by the hash of the key so there is little meaningful to glean about the list of keys that a reducer is handling. – Jeremy Beard Feb 18 '15 at 02:22