18

This question was originally a homework assignment I had, but my answer was wrong, and I'm curious what is the best solution for this problem.

The goal is to compute key aspects of the "Recommender System bootstrapping algorithm" using 4 map reduce steps. My problem is with the 3rd step, so I'll bring only its details.


input: records of the form:
1. (population id, item, number of rating users, sum of ratings, sum of ratings squared)
2. (population id, splitter item, likers/dislikers, item, number of rating users, sum of ratings, sum of ratings squared)

The 2nd form is pretty much like the 1st form, but a record for each (splitter,likers/dislikers) - where likers/dislikers is a boolean.

This means (I think) there are 2^|items| records of the seconds form for each record from the 1st form... (many classmates made the wrong (again, I think..) assumption that there are the same amount of 1st and 2nd form records)

Task description:

This step will compute, per splitter movie, the squared error (SE) induced by each movie.

  • Output: records of the form (population id, splitter item, item, squared error on item given a split on the splitter).

Hint:

assume that there exists a string that precedes (in the system’s sort order) any splitter movie id.

This must be done within one mapreduce step!

additional background:
This was learned at the context of "The Netflix Challange"

SE definition: SE definition

EDIT: additional material concerning the problem [some description on the netflix challenge and mathematical information about the problem ] can be found in this link [slides 12-24 especially]

EDIT2: note that since we are using map/reduce, we cannot assume anything about the ORDER records will be processed [in both map and reduce].

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
amit
  • 175,853
  • 27
  • 231
  • 333
  • In your text item = movie ? What's splitter items ? do you have examples of record ? – Ricky Bobby Aug 12 '11 at 14:38
  • item = movie. splitter item is a movie we split our users according to the answers we have. it is explained in more details in the attached link. – amit Aug 12 '11 at 14:40
  • Can you point to the original algorithm? I found some slides, but it would be best if you could point to it. Btw, typo in "Recommander" - I can't edit, though. – Iterator Aug 12 '11 at 14:44
  • @Iterator: as side in the slides, it is more series of computations than an algorithm. the last computation is on slide 24. my problem is computing these steps using map/reduce, and as I said, I failed to find the right computation for the 3rd step. – amit Aug 13 '11 at 08:22
  • p.s. thank you for the spelling correction, and thanks to @Ashelly for fixing it. – amit Aug 13 '11 at 08:22
  • Are you sure likers/dislikers is a _boolean_ and not the list of users who like/disliked this film? – rds Aug 16 '11 at 21:37
  • What is the meaning of population id? Can this distinguish U+ and U- for a splitter? – rds Aug 16 '11 at 21:38
  • And since you don't explain what was the previous step and what is the next, can you refine on the definition of "sum of ratings" (sum over... * or U+ only, hence my previous question about the meaning of population) "sum of ratings squared" (is it "sum of (rating^2)" or "(sum of ratings)^2". I think I came up with something better, but I'm afraid it might be the previous step. – rds Aug 16 '11 at 22:43
  • likers/dislikers is indeed a boolean value. population id, is a unique identifier to this certain population part. the previous steps and the original task can be found [here](http://webcourse.cs.technion.ac.il/236621/Winter2010-2011/hw/WCFiles/hw4-2010-11.pdf) [question 4]. @rds: thank you for the time you are spending on it! – amit Aug 17 '11 at 05:49

1 Answers1

3

I am not sure I understand your question.

What you ultimately want is SE(U). After some math details at slides 23 and 24, it is "trivially" computed with \sum_{i} SE(U)_i

You have understood by yourself that the 4th and last sept is a map reduce to get this sum.

The 3rd step is a map reduce to get (LaTeX style)

SE(U)_i = \sum_{u in U_i} (r_{u,i} - r_i)^2

enter image description here

  • The reduce function sums over u in U_i
  • The map function splits the terms to be summed

In Python this might look like:

def map(Ui):
    ''' Ui is the list of user who have rated the film i'''
    for user in Ui:
        results.append((user,(r_{u,i} - r_i)^2))

def reduce(results):
    ''' Returns a final pair (item, SE(U)_i ) '''
    return (item, sum([value for user,value in results]))

Edit: My original answer was incomplete. Let me expain again.

What you ultimately want is SE(U) for every splitter.

Step a prepares some useful data about items. The emitted entries are defined with:

key = (population_id, item)
value =
    number: |U_i|,
    sum_of_ratings: \sum_{u \ in U_i} r_{u,i} 
    sum_of_squared_ratings: \sum_{u \in U_i} r_{u,i} ^2
  • The map function explodes the statistics over the items.
  • The reduce functions computes the sums.

Now, for any given splitter movie M:

U_M = U_{+M} + U_{-M} + U_{M?}

Step b explicitly computes, for each splitter M, the statistics for the small sub-populations M+ and M-.

NB likers/dislikers is not a boolean per se, it is the sub-population identicator '+' or '-'

There are 2 new entries for each splitter item:

key = (population_id, item, M, '+') 
value = 
    number: |U_i(+)|
    sum_of_ratings: \sum_{u \ in U_i(+)} r_{u,i}
    sum_of_squared_ratings: \sum_{u \in U_i(+)} r_{u,i} ^2

Same thing for '-'

Or if you like better the dis/likers notation

key = (population_id, item, M, dis/likers) 
value = 
    number: |U_i(dis/likers)|
    sum_of_ratings: \sum_{u \ in U_i(dis/likers)} r_{u,i}
    sum_of_squared_ratings: \sum_{u \in U_i(dis/likers)} r_{u,i} ^2

cf Middle of slide 24

NB If you consider each film might be a splitter there are 2x |item|^2 items of the second form ; that's because item -> (boolean, item, splitter) -- which is far less than your 2^|item| evaluation taht you haven't explained.

Step c computes, for each splitter M, the estimated SE by each movie, i.e. SE(U_M)_i

Because a sum can be split accross its different members:

U_M = U_{+M} + U_{-M} + U_{M?}

SE(U_M)_i = SE(U_M?)_i + SE(U_+M) + SE(U_-M)

with SE(U_{+M}) explicitly computed with this map function:

def map(key, value):
    '''     
    key = (population_id, item, M, dis/likers) 
    '''
    value = 
        count: 1
        dist: (r_u,i - r_i)^2

    emit key, value

def reduce(key, values):
    ''' 
    This function explicitly computes the SE for dis/likers
    key = (population_id, item, M, dis/likers)
    value= count, dist
    '''
    emit key, sum(count, sum(dist))

Now all we need SE(U_{M?})_i which is a "trivial" computation given in slide 24:

SE(?)_i = \sum_{u \in U_i(?)}{r_{u,i}^2} - (\sum r)^2 / |U_i(?)|

Of course, we are not going to do this big sums, but use the remark just above in the lecture, and the data already computed in step a (that's the conclusion I draw from slide 24 from the last 3 equations)

SE(?)_i = \sum_{u \in U_i}{r_{u,i}^2} - \sum_{u \in U_i('+'/'-')}{r_{u,i}^2} - (...)/ (|U_i| - |U_i('+'/'-'))

So this one is even not a Map/Reduce, it is just a finalize step:

def finalize(key, values):
    for [k in keys if k match key]:
        ''' From all entries get
        # from step a
        key = (population_id, item) value=(nb_ratings, sum_ratings, sum_ratings_squared)
        # from step b
        key = (population_id, item, M, '+') value=(nb_ratings_likers, sum_ratings_likers, sum_ratings_squared_likers)
        key = (population_id, item, M, '-') value=(nb_ratings_dislikers, sum_ratings_dislikers, sum_ratings_squared_dislikers)
        # from step c
        key = (population_id, item, M, '+') value=(se_likers)
        key = (population_id, item, M, '-') value=(se_dislikers)
        '''
        se_other = sum_rating_squared - sum_ratings_squared_likers  - sum_ratins_squared_dislikers - sum_ratings_likers / (nb_ratings -  (nb_ratings_likers)) - sum_ratins_squared_dislikers  - sum_ratings_likers / (nb_ratings -  (nb_ratings_likers))
        emit
            key: (population_id, splitter, item)
            value : se_likers + se_dislikers + se_other

Step d Finally, the last steps computes the SE for U_M. It is simply the sum of previous entries, and a trivual Map/Reduce:

For a splitter M:

SE(U_M) = \sum_i SE(U_M)_i = \sum_i SE(U_M?)_i + \sum_i SE(U_+M) + \sum_i SE(U_-M)
rds
  • 26,253
  • 19
  • 107
  • 134
  • And of course the actual computation of `(r_{u,i} - r_i)^2` must be done on each done (that's the resposibily of the framework and it's results object). Also not that the variables are in LaTeX style, this won't compile in Python. – rds Aug 16 '11 at 15:25
  • please use the terms of the question, the input for this stage is NOT `r_{u,i}`, it is **records** in the given form [two types, mentioned in the question]. also, it seems this solution does not use the hint given by the lecturer [which might be Ok, but still raises a red flag]. – amit Aug 16 '11 at 18:37
  • I think I know where you misunderstood me, the given records are the input of step3, the specific step I'm having troubles with, and not to the whole algorithm. – amit Aug 16 '11 at 18:57
  • @rds: allowed myself to [TeXify](http://texify.com/) your formula for better readability! – Jean-François Corbett Aug 16 '11 at 20:32
  • Merci @Jean-François, I haven't found any reference to LaTex in the help about the Mardown syntax. Good to know it is supported, even though my answer is not good. – rds Aug 16 '11 at 21:52
  • @rds: first sorry it took me a lot of time to comment, I didn't get a notify for the edit :\ second: your solution is very neat and elegant. my 1st,2nd,4th step look pretty much the same as you did. However, the finalize step is NOT allowed. Can you make some modification that this 'finalize' will be handled within the step? – amit Aug 18 '11 at 18:24
  • @rds: also, you get a +1 from me, and unless a solution that do it with only map reduce show up in the next 20 hours, you will also get the bounty, but I will not accept until I find a solution with only map/reduce steps. nevertheless, your solution is very good, and it was a real pleasure reading it. – amit Aug 18 '11 at 18:26
  • I acknowledge I know very little about MapReduce. But I don't see how you can avoid to emit the rating per film for each user user, because this is the core of the computation of a SE. Also, I know finalize was not in the original paper but some frameworks have added it since (at least [MongoDB](http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-FinalizeFunction), [octo.py](http://code.google.com/p/octopy) What about Hadoop?) – rds Aug 18 '11 at 20:53