I am not sure I understand your question.
What you ultimately want is SE(U). After some math details at slides 23 and 24, it is "trivially" computed with \sum_{i} SE(U)_i
You have understood by yourself that the 4th and last sept is a map reduce to get this sum.
The 3rd step is a map reduce to get (LaTeX style)
SE(U)_i = \sum_{u in U_i} (r_{u,i} - r_i)^2

- The reduce function sums over u in U_i
- The map function splits the terms to be summed
In Python this might look like:
def map(Ui):
''' Ui is the list of user who have rated the film i'''
for user in Ui:
results.append((user,(r_{u,i} - r_i)^2))
def reduce(results):
''' Returns a final pair (item, SE(U)_i ) '''
return (item, sum([value for user,value in results]))
Edit: My original answer was incomplete. Let me expain again.
What you ultimately want is SE(U) for every splitter.
Step a prepares some useful data about items. The emitted entries are defined with:
key = (population_id, item)
value =
number: |U_i|,
sum_of_ratings: \sum_{u \ in U_i} r_{u,i}
sum_of_squared_ratings: \sum_{u \in U_i} r_{u,i} ^2
- The map function explodes the statistics over the items.
- The reduce functions computes the sums.
Now, for any given splitter movie M:
U_M = U_{+M} + U_{-M} + U_{M?}
Step b explicitly computes, for each splitter M, the statistics for the small sub-populations M+ and M-.
NB likers/dislikers is not a boolean per se, it is the sub-population identicator '+' or '-'
There are 2 new entries for each splitter item:
key = (population_id, item, M, '+')
value =
number: |U_i(+)|
sum_of_ratings: \sum_{u \ in U_i(+)} r_{u,i}
sum_of_squared_ratings: \sum_{u \in U_i(+)} r_{u,i} ^2
Same thing for '-'
Or if you like better the dis/likers notation
key = (population_id, item, M, dis/likers)
value =
number: |U_i(dis/likers)|
sum_of_ratings: \sum_{u \ in U_i(dis/likers)} r_{u,i}
sum_of_squared_ratings: \sum_{u \in U_i(dis/likers)} r_{u,i} ^2
cf Middle of slide 24
NB If you consider each film might be a splitter there are 2x |item|^2 items of the second form ; that's because item -> (boolean, item, splitter) -- which is far less than your 2^|item| evaluation taht you haven't explained.
Step c computes, for each splitter M, the estimated SE by each movie, i.e. SE(U_M)_i
Because a sum can be split accross its different members:
U_M = U_{+M} + U_{-M} + U_{M?}
SE(U_M)_i = SE(U_M?)_i + SE(U_+M) + SE(U_-M)
with SE(U_{+M})
explicitly computed with this map function:
def map(key, value):
'''
key = (population_id, item, M, dis/likers)
'''
value =
count: 1
dist: (r_u,i - r_i)^2
emit key, value
def reduce(key, values):
'''
This function explicitly computes the SE for dis/likers
key = (population_id, item, M, dis/likers)
value= count, dist
'''
emit key, sum(count, sum(dist))
Now all we need SE(U_{M?})_i
which is a "trivial" computation given in slide 24:
SE(?)_i = \sum_{u \in U_i(?)}{r_{u,i}^2} - (\sum r)^2 / |U_i(?)|
Of course, we are not going to do this big sums, but use the remark just above in the lecture, and the data already computed in step a (that's the conclusion I draw from slide 24 from the last 3 equations)
SE(?)_i = \sum_{u \in U_i}{r_{u,i}^2} - \sum_{u \in U_i('+'/'-')}{r_{u,i}^2} - (...)/ (|U_i| - |U_i('+'/'-'))
So this one is even not a Map/Reduce, it is just a finalize step:
def finalize(key, values):
for [k in keys if k match key]:
''' From all entries get
# from step a
key = (population_id, item) value=(nb_ratings, sum_ratings, sum_ratings_squared)
# from step b
key = (population_id, item, M, '+') value=(nb_ratings_likers, sum_ratings_likers, sum_ratings_squared_likers)
key = (population_id, item, M, '-') value=(nb_ratings_dislikers, sum_ratings_dislikers, sum_ratings_squared_dislikers)
# from step c
key = (population_id, item, M, '+') value=(se_likers)
key = (population_id, item, M, '-') value=(se_dislikers)
'''
se_other = sum_rating_squared - sum_ratings_squared_likers - sum_ratins_squared_dislikers - sum_ratings_likers / (nb_ratings - (nb_ratings_likers)) - sum_ratins_squared_dislikers - sum_ratings_likers / (nb_ratings - (nb_ratings_likers))
emit
key: (population_id, splitter, item)
value : se_likers + se_dislikers + se_other
Step d Finally, the last steps computes the SE for U_M. It is simply the sum of previous entries, and a trivual Map/Reduce:
For a splitter M:
SE(U_M) = \sum_i SE(U_M)_i = \sum_i SE(U_M?)_i + \sum_i SE(U_+M) + \sum_i SE(U_-M)