49

I have a collection of documents:

date: Date
users: [
  { user: 1, group: 1 }
  { user: 5, group: 2 }
]

date: Date
users: [
  { user: 1, group: 1 }
  { user: 3, group: 2 }
]

I would like to query against this collection to find all documents where every user id in my array of users is in another array, [1, 5, 7]. In this example, only the first document matches.

The best solution I've been able to find is to do:

$where: function() { 
  var ids = [1, 5, 7];
  return this.users.every(function(u) { 
    return ids.indexOf(u.user) !== -1;
  });
}

Unfortunately, this seems to hurt performance is stated in the $where docs:

$where evaluates JavaScript and cannot take advantage of indexes.

How can I improve this query?

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
Wex
  • 15,539
  • 10
  • 64
  • 107

3 Answers3

72

The query you want is this:

db.collection.find({"users":{"$not":{"$elemMatch":{"user":{$nin:[1,5,7]}}}}})

This says find me all documents that don't have elements that are outside of the list 1,5,7.

Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133
  • 3
    p.s. this answer takes 10ms on the sample data set generated from the other "answer" – Asya Kamsky May 13 '14 at 03:24
  • 2
    Fantastic, this does appear to be giving me the same result as the query in my question, and it returns about 10x faster. – Wex May 13 '14 at 04:12
  • `{ "users.user": { "$not": { $nin: [1, 5, 7] } } }` - Would you happen to know why this isn't equivalent? – Wex May 13 '14 at 04:41
  • 5
    the key is $elemMatch which makes the distinction that you want a specific element to satisfy a particular condition, as opposed to the document as a whole to satisfy the condition. because arrays allow "users.user" to have multiple values in a single document, it can be ambiguous whether you mean any element or specific element. As you have it, any element can satisfy the $not one of these and it becomes equivalent to an $in. $elemMatch says single element has to be not one of these, meaning that there now has to be another element that isn't 1,5 or 7. the $not now excludes those _documents_ – Asya Kamsky May 13 '14 at 04:45
  • my pleasure. I'm a big fan of aggregation framework though, so don't discount its value just because it wasn't needed in this particular case :) – Asya Kamsky May 13 '14 at 04:53
  • 5
    Good answer. But worth noting that this will also include docs where `users` is either missing or empty. – JohnnyHK May 13 '14 at 12:29
  • 1
    good point, @JohnnyHK I was assuming that users array always exists and contains some users. To exclude those this query can be "$and"ed with `{"users.user":{$exists:true}}` – Asya Kamsky May 13 '14 at 18:00
  • I don't understand why Mongo makes us bend over backwards to do this. Couldn't they just provide an `$allMatch` operator? – Andy Mar 29 '17 at 23:24
  • Especially since you have to use this pattern any time you want to make a thorough `validator` for documents containing arrays. – Andy Mar 29 '17 at 23:31
  • There is an $all operator it means match all of these elements. You're thinking $only operator and it's on the roadmap. Validated for arrays probably should be something different though. – Asya Kamsky Mar 30 '17 at 12:26
  • @AsyaKamsky I am looking for a way to give 100 bounties to this answer. Thanks. – BetterCallMe Sep 15 '21 at 17:48
16

I don't know about better, but there are a few different ways to approach this, and depending on the version of MongoDB you have available.

Not too sure if this is your intention or not, but the query as shown will match the first document example because as your logic is implemented you are matching the elements within that document's array that must be contained within the sample array.

So if you actually wanted the document to contain all of those elements, then the $all operator would be the obvious choice:

db.collection.find({ "users.user": { "$all": [ 1, 5, 7 ] } })

But working with the presumption that your logic is actually intended, at least as per suggestion you can "filter" those results by combining with the $in operator so that there are less documents subject to your $where** condition in evaluated JavaScript:

db.collection.find({
    "users.user": { "$in": [ 1, 5, 7 ] },
    "$where": function() { 
        var ids = [1, 5, 7];
        return this.users.every(function(u) { 
            return ids.indexOf(u.user) !== -1;
        });
    }
})

And you get an index though the actual scanned will be multiplied by the number of elements in the arrays from the matched documents, but still better than without the additional filter.

Or even possibly you consider the logical abstraction of the $and operator used in combination with $or and possibly the $size operator depending on your actual array conditions:

db.collection.find({
    "$or": [
        { "users.user": { "$all": [ 1, 5, 7 ] } },
        { "users.user": { "$all": [ 1, 5 ] } },
        { "users.user": { "$all": [ 1, 7 ] } },
        { "users": { "$size": 1 }, "users.user": 1 },
        { "users": { "$size": 1 }, "users.user": 5 },
        { "users": { "$size": 1 }, "users.user": 7 }
    ]
})

So this is a generations of all of the possible permutations of your matching condition, but again performance will likely vary depending on your available installed version.

NOTE: Actually a complete fail in this case as this does something entirely different and in fact results in a logical $in


Alternates are with the aggregation framework, your mileage may vary on which is most efficient due to the number of documents in your collection, one approach with MongoDB 2.6 and upwards:

db.problem.aggregate([
    // Match documents that "could" meet the conditions
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Keep your original document and a copy of the array
    { "$project": {
        "_id": {
            "_id": "$_id",
            "date": "$date",
            "users": "$users"
        },
        "users": 1,
    }},

    // Unwind the array copy
    { "$unwind": "$users" },

    // Just keeping the "user" element value
    { "$group": {
        "_id": "$_id",
        "users": { "$push": "$users.user" }
    }},

    // Compare to see if all elements are a member of the desired match
    { "$project": {
        "match": { "$setEquals": [
            { "$setIntersection": [ "$users", [ 1, 5, 7 ] ] },
            "$users"
        ]}
    }},

    // Filter out any documents that did not match
    { "$match": { "match": true } },

    // Return the original document form
    { "$project": {
        "_id": "$_id._id",
        "date": "$_id.date",
        "users": "$_id.users"
    }}
])

So that approach uses some newly introduced set operators in order to compare the contents, though of course you need to restructure the array in order to make the comparison.

As pointed out, there is a direct operator to do this in $setIsSubset which does the equivalent of the combined operators above in a single operator:

db.collection.aggregate([
    { "$match": { 
        "users.user": { "$in": [ 1,5,7 ] } 
    }},
    { "$project": {
        "_id": {
            "_id": "$_id",
            "date": "$date",
            "users": "$users"
        },
        "users": 1,
    }},
    { "$unwind": "$users" },
    { "$group": {
        "_id": "$_id",
        "users": { "$push": "$users.user" }
    }},
    { "$project": {
        "match": { "$setIsSubset": [ "$users", [ 1, 5, 7 ] ] }
    }},
    { "$match": { "match": true } },
    { "$project": {
        "_id": "$_id._id",
        "date": "$_id.date",
        "users": "$_id.users"
    }}
])

Or with a different approach while still taking advantage of the $size operator from MongoDB 2.6:

db.collection.aggregate([
    // Match documents that "could" meet the conditions
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Keep your original document and a copy of the array
    // and a note of it's current size
    { "$project": {
        "_id": {
            "_id": "$_id",
            "date": "$date",
            "users": "$users"
        },
        "users": 1,
        "size": { "$size": "$users" }
    }},

    // Unwind the array copy
    { "$unwind": "$users" },

    // Filter array contents that do not match
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Count the array elements that did match
    { "$group": {
        "_id": "$_id",
        "size": { "$first": "$size" },
        "count": { "$sum": 1 }
    }},

    // Compare the original size to the matched count
    { "$project": { 
        "match": { "$eq": [ "$size", "$count" ] } 
    }},

    // Filter out documents that were not the same
    { "$match": { "match": true } },

    // Return the original document form
    { "$project": {
        "_id": "$_id._id",
        "date": "$_id.date",
        "users": "$_id.users"
    }}
])

Which of course can still be done, though a little more long winded in versions prior to 2.6:

db.collection.aggregate([
    // Match documents that "could" meet the conditions
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Keep your original document and a copy of the array
    { "$project": {
        "_id": {
            "_id": "$_id",
            "date": "$date",
            "users": "$users"
        },
        "users": 1,
    }},

    // Unwind the array copy
    { "$unwind": "$users" },

    // Group it back to get it's original size
    { "$group": { 
        "_id": "$_id",
        "users": { "$push": "$users" },
        "size": { "$sum": 1 }
    }},

    // Unwind the array copy again
    { "$unwind": "$users" },

    // Filter array contents that do not match
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Count the array elements that did match
    { "$group": {
        "_id": "$_id",
        "size": { "$first": "$size" },
        "count": { "$sum": 1 }
    }},

    // Compare the original size to the matched count
    { "$project": { 
        "match": { "$eq": [ "$size", "$count" ] } 
    }},

    // Filter out documents that were not the same
    { "$match": { "match": true } },

    // Return the original document form
    { "$project": {
        "_id": "$_id._id",
        "date": "$_id.date",
        "users": "$_id.users"
    }}
])

That generally rounds out the different ways, try them out and see what works best for you. In all likelihood the simple combination of $in with your existing form is probably going to be the best one. But in all cases, make sure you have an index that can be selected:

db.collection.ensureIndex({ "users.user": 1 })

Which is going to give you the best performance as long as you are accessing that in some way, as all the examples here do.


Verdict

I was intrigued by this so ultimately contrived a test case in order to see what had the best performance. So first some test data generation:

var batch = [];
for ( var n = 1; n <= 10000; n++ ) {
    var elements = Math.floor(Math.random(10)*10)+1;

    var obj = { date: new Date(), users: [] };
    for ( var x = 0; x < elements; x++ ) {
        var user = Math.floor(Math.random(10)*10)+1,
            group = Math.floor(Math.random(10)*10)+1;

        obj.users.push({ user: user, group: group });
    }

    batch.push( obj );

    if ( n % 500 == 0 ) {
        db.problem.insert( batch );
        batch = [];
    }

} 

With 10000 documents in a collection with random arrays from 1..10 in length holding random values of 1..0, I came to a match count of 430 documents (reduced from 7749 from the $in match ) with the following results (avg):

  • JavaScript with $in clause: 420ms
  • Aggregate with $size : 395ms
  • Aggregate with group array count : 650ms
  • Aggregate with two set operators : 275ms
  • Aggregate with $setIsSubset : 250ms

Noting that over the samples done all but the last two had a peak variance of approximately 100ms faster, and the last two both exhibited 220ms response. The largest variations were in the JavaScript query which also exhibited results 100ms slower.

But the point here is relative to hardware, which on my laptop under a VM is not particularly great, but gives an idea.

So the aggregate, and specifically the MongoDB 2.6.1 version with set operators clearly wins on performance with the additional slight gain coming from $setIsSubset as a single operator.

This is particularly interesting given (as indicated by the 2.4 compatible method) the largest cost in this process will be the $unwind statement ( over 100ms avg ), so with the $in selection having a mean around 32ms the rest of the pipeline stages execute in less than 100ms on average. So that gives a relative idea of aggregation versus JavaScript performance.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
  • 1
    Thanks for pointing me in the direction of aggregation. Looking at the docs, it seems like [setIsSubset](http://docs.mongodb.org/manual/reference/operator/aggregation/setIsSubset/#exp._S_setIsSubset) would also be appropriate. I'm going to see how these perform against what I already have. – Wex May 12 '14 at 15:38
  • @Wex right you are, as that would be equivalent to the two set operations as were used in the example. Honestly missed that by being too focused on pre 2.6 examples, but worth the addition of it's own example as well. Having not run something like this against sizable data I'm not too sure how the performance varies. But I still do have the suspicion that either of the first two forms without the aggregation method would be the most performant options. – Neil Lunn May 12 '14 at 22:48
  • @Wex Actually quite intrigued on what your results might be with real world data. I went back to this with a test case where the results were quite intriguing. – Neil Lunn May 13 '14 at 02:40
  • 5
    @AsyaKamsky Well you are right that despite the negation of an index it would be the better solution. But there was no need to be as rude as you have in response. – Neil Lunn May 13 '14 at 04:23
1

I just spent a substantial portion of my day trying to implement Asya's solution above with object-comparisons rather than strict equality. So I figured I'd share it here.

Let's say you expanded your question from userIds to full users. You want to find all documents where every item in its users array is present in another users array: [{user: 1, group: 3}, {user: 2, group: 5},...]

This won't work: db.collection.find({"users":{"$not":{"$elemMatch":{"$nin":[{user: 1, group: 3},{user: 2, group: 5},...]}}}}}) because $nin only works for strict equality. So we need to find a different way of expressing "Not in array" for arrays of objects. And using $where would slow down the query too much.

Solution:

db.collection.find({
 "users": {
   "$not": {
     "$elemMatch": {
       // if all of the OR-blocks are true, element is not in array
       "$and": [{
         // each OR-block == true if element != that user
         "$or": [
           "user": { "ne": 1 },
           "group": { "ne": 3 }
         ]
       }, {
         "$or": [
           "user": { "ne": 2 },
           "group": { "ne": 5 }
         ]
       }, {
         // more users...
       }]
     }
   }
 }
})

To round out the logic: $elemMatch matches all documents that have a user not in the array. So $not will match all documents that have all of the users in the array.

Mark Bryk
  • 31
  • 3