1

I'm currently using MongoDB's aggregation framework in a Java web application to generate recommendations for users based on the preferences of other users.

One of the primary methodologies I'm using is looking at array intersection.

Right now my algorithm simply considers two users "similar" if they have a non-zero array intersection.

To build a more accurate algorithm, I want to weigh the size of the set intersection into my aggregation pipeline.

Is there a way to do this?

David K.
  • 679
  • 2
  • 10
  • 23
  • Would be interesting how you are doing that non-zero array intersection. In the aggregation framework? – drmirror Aug 06 '13 at 01:36
  • Do you compare one_to_one users or you need one_to_many? – evilive Aug 06 '13 at 05:40
  • Can you provide some sample documents and what you expect to get back as result? – Derick Aug 06 '13 at 13:53
  • Hey all, thanks for the quick feedback. It's a one-to-many comparison, checking the main user's favorites array against that of every other user. As for the non-zero intersection, I simply $match out the users for which user.favorites $nin main.favorites. And Derick, sure. My input documents are: { user: "David", favorites: [1,2,3] } basically and I want my output to be: { movie_id: 2, score: 12 } where the score is weighted by the number of common movies (read: size of set intersection) between the main user and each other user. Just for clarification, the favorites array is movie_ids. – David K. Aug 06 '13 at 16:40

2 Answers2

3

If I understand your question, you have data something like the following:

db.users.insert({_id: 100, likes: [
    'pina coladas',
    'long walks on the beach',
    'getting caught in the rain'
]})
db.users.insert({_id: 101, likes: [
    'cheese',
    'bowling',
    'pina coladas'
]})
db.users.insert({_id: 102, likes: [
    'pina coladas',
    'long walks on the beach'
]})
db.users.insert({_id: 103, likes: [
    'getting caught in the rain',
    'bowling'
]})
db.users.insert({_id: 104, likes: [
    'pina coladas',
    'long walks on the beach',
    'getting caught in the rain'
]})

and you wish to compute for a given user how many matching features ('likes' in this example) they have with other users? The following aggregation pipeline will accomplish this:

user = 100
user_likes = db.users.findOne({_id: user}).likes
return_only = 2 // number of matches to return

db.users.aggregate([
    {$unwind: '$likes'},
    {$match: {
        $and: [
            {_id: {$ne: user}},
            {likes: {$in: user_likes}}
        ]
    }},
    {$group: {_id: '$_id', common: {$sum: 1}}},
    {$sort: {common: -1}},
    {$limit: return_only}
])

Given the example input data above this will output the following result showing the top 2 matches:

{
    "result" : [
        {
            "_id" : 104,
            "common" : 3
        },
        {
            "_id" : 102,
            "common" : 2
        }
    ],
    "ok" : 1
}

Note that I assumed that you will want only the top so many matches, since there may be a very large number of users. The $sort step followed by the $limit step will accomplish this. If that is not the case then you can just omit the last two steps in the pipeline.

I hope this helps! Let me know if you have further questions.

Bruce

Bruce Lucas
  • 949
  • 5
  • 5
1

As of MongoDB 2.6+, you can use the $size expression.

If you are doing an intersection of two arrays (sets) you will first want to use the $setIntersection operator to find the intersection of the two sets. Another example is given in this question.

You can then use the new $size operator to get the size of the output of the intersection stage of your pipeline. This answer provides an example of using the new $size expression.

Community
  • 1
  • 1
Scott
  • 16,711
  • 14
  • 75
  • 120