2

I'm building an application that could be likened to a dating application.

I've got some documents with a structure like this:

$ db.profiles.find().pretty()

[
  {
    "_id": 1,
    "firstName": "John",
    "lastName": "Smith",
    "fieldValues": [
      "favouriteColour|red",
      "food|pizza",
      "food|chinese"
    ]
  },
  {
    "_id": 2,
    "firstName": "Sarah",
    "lastName": "Jane",
    "fieldValues": [
      "favouriteColour|blue",
      "food|pizza",
      "food|mexican",
      "pets|yes"
    ]
  },
  {
    "_id": 3,
    "firstName": "Rachel",
    "lastName": "Jones",
    "fieldValues": [
      "food|pizza"
    ]
  }
]

What I'm trying to so is identify profiles that match each other on one or more fieldValues.

So, in the example above, my ideal result would look something like:

<some query>

result:
[
  {
    "_id": "507f1f77bcf86cd799439011",
    "dateCreated": "2013-12-01",
    "profiles": [
      {
        "_id": 1,
        "firstName": "John",
        "lastName": "Smith",
        "fieldValues": [
          "favouriteColour|red",
          "food|pizza",
          "food|chinese"
        ]
      },
      {
        "_id": 2,
        "firstName": "Sarah",
        "lastName": "Jane",
        "fieldValues": [
          "favouriteColour|blue",
          "food|pizza",
          "food|mexican",
          "pets|yes"
        ]
      },

    ]
  },
  {
    "_id": "356g1dgk5cf86cd737858595",
    "dateCreated": "2013-12-02",
    "profiles": [
      {
        "_id": 1,
        "firstName": "John",
        "lastName": "Smith",
        "fieldValues": [
          "favouriteColour|red",
          "food|pizza",
          "food|chinese"
        ]
      },
      {
        "_id": 3,
        "firstName": "Rachel",
        "lastName": "Jones",
        "fieldValues": [
          "food|pizza"
        ]
      }
    ]
  }
]

I've thought about doing this either as a map reduce, or with the aggregation framework.

Either way, the 'result' would be persisted to a collection (as per the 'results' above)

My question is which of the two would be more suited? And where would I start to implement this?

Edit

In a nutshell, the model can't easily be changed.
This isn't like a 'profile' in the traditional sense.

What I'm basically looking to do (in psuedo code) is along the lines of:

foreach profile in db.profiles.find()
  foreach otherProfile in db.profiles.find("_id": {$ne: profile._id})
    if profile.fieldValues matches any otherProfie.fieldValues
      //it's a match!

Obviously that kind of operation is very very slow!

It may also be worth mentioning that this data is never displayed, it's literally just a string value that's used for 'matching'

Alex
  • 37,502
  • 51
  • 204
  • 332
  • how many entries in your profile collection (approximately)? And do you need the full profile entry output or just which pairs match, or which pairs match and what attributes they match on? – Asya Kamsky May 07 '13 at 22:35
  • btw, why doesn't your designed output include the pairing "sarah jane" and "rachel jones"? – Asya Kamsky May 08 '13 at 22:25
  • It should indeed include "sarah jane" and "rachel jones" due to food|pizza – Alex May 08 '13 at 23:18
  • Full profile entry would be nice (as per my example of an 'ideal result') - don't need to know what fields they matched on - a separate process will handle that later – Alex May 08 '13 at 23:20
  • then just use the loop like the first part of my answer. – Asya Kamsky May 08 '13 at 23:36

2 Answers2

10

MapReduce would run JavaScript in a separate thread and use the code you provide to emit and reduce parts of your document to aggregate on certain fields. You can certainly look at the exercise as aggregating over each "fieldValue". Aggregation framework can do this as well but would be much faster as the aggregation would run on the server in C++ rather than in a separate JavaScript thread. But aggregation framework may return more data back than 16MB in which case you would need to do more complex partitioning of the data set.

But it seems like the problem is a lot simpler than this. You just want to find for each profile what other profiles share particular attributes with it - without knowing the size of your dataset, and your performance requirements, I'm going to assume that you have an index on fieldValues so it would be efficient to query on it and then you can get the results you want with this simple loop:

> db.profiles.find().forEach( function(p) { 
       print("Matching profiles for "+tojson(p));
       printjson(
            db.profiles.find(
               {"fieldValues": {"$in" : p.fieldValues},  
                                "_id" : {$gt:p._id}}
            ).toArray()
       ); 
 }  );

Output:

Matching profiles for {
    "_id" : 1,
    "firstName" : "John",
    "lastName" : "Smith",
    "fieldValues" : [
        "favouriteColour|red",
        "food|pizza",
        "food|chinese"
    ]
}
[
    {
        "_id" : 2,
        "firstName" : "Sarah",
        "lastName" : "Jane",
        "fieldValues" : [
            "favouriteColour|blue",
            "food|pizza",
            "food|mexican",
            "pets|yes"
        ]
    },
    {
        "_id" : 3,
        "firstName" : "Rachel",
        "lastName" : "Jones",
        "fieldValues" : [
            "food|pizza"
        ]
    }
]
Matching profiles for {
    "_id" : 2,
    "firstName" : "Sarah",
    "lastName" : "Jane",
    "fieldValues" : [
        "favouriteColour|blue",
        "food|pizza",
        "food|mexican",
        "pets|yes"
    ]
}
[
    {
        "_id" : 3,
        "firstName" : "Rachel",
        "lastName" : "Jones",
        "fieldValues" : [
            "food|pizza"
        ]
    }
]
Matching profiles for {
    "_id" : 3,
    "firstName" : "Rachel",
    "lastName" : "Jones",
    "fieldValues" : [
        "food|pizza"
    ]
}
[ ]

Obviously you can tweak the query to not exclude already matched up profiles (by changing {$gt:p._id} to {$ne:{p._id}} and other tweaks. But I'm not sure what additional value you would get from using aggregation framework or mapreduce as this is not really aggregating a single collection on one of its fields (judging by the format of the output that you show). If your output format requirements are flexible, certainly it's possible that you could use one of the built in aggregation options as well.

I did check to see what this would look like if aggregating around individual fieldValues and it's not bad, it might help you if your output can match this:

> db.profiles.aggregate({$unwind:"$fieldValues"}, 
      {$group:{_id:"$fieldValues", 
              matchedProfiles : {$push:
               {  id:"$_id", 
                  name:{$concat:["$firstName"," ", "$lastName"]}}},
                  num:{$sum:1}
               }}, 
      {$match:{num:{$gt:1}}});
{
    "result" : [
        {
            "_id" : "food|pizza",
            "matchedProfiles" : [
                {
                    "id" : 1,
                    "name" : "John Smith"
                },
                {
                    "id" : 2,
                    "name" : "Sarah Jane"
                },
                {
                    "id" : 3,
                    "name" : "Rachel Jones"
                }
            ],
            "num" : 3
        }
    ],
    "ok" : 1
}

This basically says "For each fieldValue ($unwind) group by fieldValue an array of matching profile _ids and names, counting how many matches each fieldValue accumulates ($group) and then exclude the ones that only have one profile matching it.

Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133
  • I think the aggregate one is almost there... I'm not quite sure it's matching as i'd expect though..? I want to get a list of profiles that match together, this sees to output a group of key|value and profiles that contain it? – Alex May 08 '13 at 22:11
  • yes, that's because it aggregates around each fieldValue. If you want pairwise output you need to go with the loop and generate the output yourself like the first part of my answer shows. – Asya Kamsky May 08 '13 at 22:13
  • That makes sense... I'm playing with that now, but it seems to be matching too many... for example, i changed the .toArray to .toArray().length - and the "Matching profiles for "+tojson(p._id) (so it only outputs the id) - some of the array lengths are > 350... there's only 200 or so "profiles" in my test collection... :-s – Alex May 08 '13 at 22:52
  • yeah, that's not possible - are you sure you didn't change something else? When you do a db.collection.find() you cannot get back more documents than exist in the collection. I assume you mean you changed printjson( ....find().toArray()) to be print(...find().toArray.length)? that's equivalent to just doing xxx.find().count() - no need to convert to array to get the number of matching results. – Asya Kamsky May 08 '13 at 23:35
  • My bad, you're right.... I think this is pretty much where I need it to be so I'll award the bounty. I'll start a separate question for my next, erm, question....! Thank you! – Alex May 09 '13 at 08:02
  • How can I use this from within a driver? (in particular the C# driver) - http://stackoverflow.com/questions/16458758/executing-custom-mongodb-query-with-a-foreach-with-the-c-sharp-driver – Alex May 09 '13 at 09:40
  • Is there a way of doing anything with the result / converting it to a cursor as currently, it just prints it to the console...? – Alex May 09 '13 at 23:11
0

First, in distinguishing between the two, MongoDB's aggregation framework is basically just mapreduce, but more limited so that it can provide a more straightforward interface. To my knowledge, the aggregation framework cannot do anything more than general mapreduce.

With that in mind, the question then becomes: is your transformation something that can be modeled in the aggregation framework, or do you need to fall back to the more powerful mapreduce.

If I understand what you're trying to do, I think it is feasible with the aggregation framework if you change your schema a bit. Schema design is one of the trickiest things with Mongo, and you need to take a lot of things into consideration when deciding how to structure your data. Despite knowing very little about your application, I'm going to go out on a limb and make a suggestion anyway.

Specifically, I'd suggest changing the way you structure your fieldValues subdocument into something like this:

{
    "_id": 2,
    "firstName": "Sarah",
    "lastName": "Jane",
    "likes": {
        "colors": ["blue"],
        "foods": ["pizza", "mexican"],
        "pets": true
    }
}

That is, store the multi-valued attributes in an array. This would allow you to take advantage of the aggregation framework's $unwind operator. (See the example in the Mongo documentation.) But, depending on what you're trying to accomplish, this may or may not be appropriate.

Taking a step back, though, you may not find it appropriate to use the aggregation framework or Mongo's mapreduce function. Their use has performance implications, and it may not be a good idea to employ them for your application's core business logic. Generally, their intended use seems to be for infrequent or ad-hoc queries simply to gain insight into one's data. So, you may be better off starting with a "real" mapreduce framework. That said, I have heard cases where the aggregation framework is used in a cron job to create core business data on a regular basis.

Doug Paul
  • 1,221
  • 12
  • 16
  • Unfortunately, I can't edit the schema (this is a simplified version of it for S.O to illustrate what I'm trying to do) - I've added an edit to further explain – Alex May 01 '13 at 08:03
  • Okay. It's not clear to me what sort of answer or advice you're looking for, then. You can do what you said in your edit with the aggregation framework. Are you looking for help putting together that aggregation command? Or do you want a more in-depth analysis on the tradeoffs between using mapreduce and the aggregation framework? – Doug Paul May 02 '13 at 00:30
  • Sorry - "help putting together that aggregation command" is pretty accurate, and have offered a bounty on the question to tempt people into helping! :-) – Alex May 07 '13 at 15:13
  • 1
    The aggregation framework is nothing like Map Reduce...the two work completely differently in different envos and scenarios etc etc. – Sammaye May 07 '13 at 20:24
  • @DougPaul you are very much mistaken about aggregation framework relationship to mapreduce. – Asya Kamsky May 07 '13 at 22:34
  • @AsyaKamsky: Oh, how so? I thought I had a pretty good understanding of the aggregation framework. At conferences and a conversation with a MongoDB developer, its been explained to me as a structured (and thus optimizable) way of doing the types of things most commonly done in mapreduce queries. So it functions quite differently, but generally serves the same purpose. That's what I tried to say above, and I still can't see how it's wrong. – Doug Paul May 15 '13 at 16:00
  • @Sammaye, I open my above question to you, too. – Doug Paul May 15 '13 at 16:03
  • You should look to the first part of Asyas answer for a extremely brief explanation, of course it doesn't cover the whole topic, but enough to prove my point – Sammaye May 15 '13 at 17:04
  • @DougPaul created to handle in more efficient fashion things that previously had to be done with mapReduce != "basically just mapreduce" it is not at all mapReduce, has zero in common with that code - it runs completely on the server all written in C++ where mapreduce executes provided javascript in a separate thread from the server. – Asya Kamsky May 15 '13 at 22:40
  • you might also check out the discussion/answer here: http://stackoverflow.com/questions/13908438/is-mongodb-aggregation-framework-faster-that-map-reduce/13912126#13912126 – Asya Kamsky May 15 '13 at 22:42
  • @Asya, thanks for following up. I can see how my statements could be misunderstood; perhaps I should have spoken more precisely. (That said, I think it's a gross overstatement to say that I'm "very much mistaken". But whatever.) – Doug Paul Jul 19 '13 at 21:29
  • @Sammaye, I think I see our misunderstanding now. Your statement that "the aggregation framework is nothing like Map Reduce" considers only *how they work*. When I stated that they are basically the same, I was considering only *what they do*: aggregation. So neither statement is broadly correct. Anyway, my answer intends to convey: "If you can use one, you can probably use the other, but prefer the aggregation framework because it's faster. Even better, avoid the need to use either." – Doug Paul Jul 19 '13 at 21:57