I have a MapReduce project I am working on (specifically I am using Python and the library MrJob and plan on running using Amazon's EMR). Here is an example to sum up the issue I am having:
I have thousands of GB of json files full of customer data. I need to go and run daily, weekly, and monthly reports on each customer json line/input/object.
So for the Map step I currently do:
map_step(_, customer_json_object)
c_uuid = customer_json_object.uuid
if customer_json_object.time is in daily_time_range:
yield "%s-%s" % (DAILY_CONSTANT, c_uuid), customer_json_object
if customer_json_object.time is in weekly_time_range:
yield "%s-%s" % (WEEKLY_CONSTANT, c_uuid), customer_json_object
if customer_json_object.time is in monthly_time_range:
yield "%s-%s" % (MONTHLY_CONSTANT, c_uuid), customer_json_object
And then for the reducer
reducer_step(key, customer_info)
report_type, c_uuid = key.split("-")
yield None, Create_Report(report_type, customer_info)
My question is:
Am I guaranteed here that all my data with the same key (meaning here all data for a specific customer and specific report type) will be handled by the same reducer? My Create_Report cannot be spread across multiple processes and therefore I need all the data required for a report to be handled by one process.
I am worried that it might be possible that if there are too many values for a key then they are spread out among reducers or something. However from what I read it sounds like this is how it works.
Thank you so much!! I just realized I needed to yield multiple times from the map step so this is the last piece of the puzzle for me. If this can be figured out it will be a huge win as I cannot scale my little server any farther vertically...
If it is not clear from the code above I have thousands of files of json lines of customer (or really users, no one is paying me anything) data. I want to be able to create reports for this data and the report code is generated differently depending on if its monthly, weekly, or daily. I actually am also de duplifying the data before this but this is my last step, actually producing the output. I really appreciate you taking the time to read this and help!!