Get distinct records with specified fields that match a value, paginated

Question

I'm trying to get all documents in my MongoDB collection

by distinct customer ids (custID)
where status code == 200
paginated (skipped and limit)
return specified fields

var Order = mongoose.model('Order', orderSchema());

My original thought was to use mongoose db query, but you can't use distinct with skip and limit as Distinct is a method that returns an "array", and therefore you cannot modify something that is not a "Cursor":

    Order
        .distinct('request.headers.custID')
        .where('response.status.code').equals(200)
        .limit(limit)
        .skip(skip)
        .exec(function (err, orders) {      
            callback({
                data: orders
            });
        });

So then I thought to use Aggregate, using $group to get distinct customerID records, $match to return all unique customerID records that have status code of 200, and $project to include the fields that I want:

    Order.aggregate(
        [
            {
                "$project" :
                {
                    'request.headers.custID' : 1,
                    //other fields to include
                }
            },
            {
                "$match" :
                {
                    "response.status.code" : 200
                }
            },
            {
                "$group": {
                    "_id": "$request.headers.custID"
                }
            },
            {
                "$skip": skip
            },
            {
                "$limit": limit
            }
        ],
        function (err, order) {}
    );

This returns an empty array though. If I remove project, only $request.headers.custID field is returned when in fact I need more.

Any thoughts?

score 4 · Accepted Answer · answered Dec 15 '15 at 02:23

The thing you need to understand about aggregation pipelines is generally the word "pipeline" means that each stage only receives the input that is emitted by the preceeding stage in order of execution. The best analog to think of here is "unix pipe" |, where the output of one command is "piped" to the other:

ps aux | grep mongo | tee out.txt

So aggregation pipelines work in much the same way as that, where the other main thing to consider is both $project and $group stages operate on only emitting those fields you ask for, and no others. This takes a little getting used to compared to declarative approaches like SQL, but with a little practice it becomes second nature.

Other things to get used to are stages like $match are more important to place at the beginning of a pipeline than field selection. The primary reason for this is possible index selection and usage, which speeds things up immensely. Also, field selection of $project followed by $group is somewhat redundant, as both essentially select fields anyway, and are usually best combined where appropriate anyway.

Hence most optimially you do:

Order.aggregate(
    [
        { "$match" : {
            "response.status.code" : 200
        }},
        { "$group": {
           "_id": "$request.headers.custID",  // the grouping key
           "otherField": { "$first": "$otherField" },
           // and so on for each field to select
        }},
        { "$skip": skip },
        { "$limit": limit }
    ],
    function (err, order) {}
);

Where the main thing here to remember about $group is that all other fields than _id ( which is the grouping key ) require the use of an accumulator to select, since there is in fact always a multiple occurance of the values for the grouping key.

In this case we are using $first as an accumulator, which will take the first occurance from the grouping boundary. Commonly this is used following a $sort, but does not need to be so, just as long as you understand the behavior of what is selected.

Other accumulators like $max simply take the largest value of the field from within the values inside the grouping key, and are therefore independant of the "current record/document" unlike $first or $last. So it all depends on your needs.

Of course you can shorcut the selection in modern MongoDB releases after MongoDB 2.6 with the $$ROOT variable:

Order.aggregate(
    [
        { "$match" : {
            "response.status.code" : 200
        }},
        { "$group": {
           "_id": "$request.headers.custID",  // the grouping key
           "document": { "$first": "$$ROOT" }
        }},
        { "$skip": skip },
        { "$limit": limit }
    ],
    function (err, order) {}
);

Which would take a copy of all fields in the document and place them under the named key ( which is "document" in this case ). It's a shorter way to notate, but of course the resulting document has a different structure, being now all under the one key as sub-fields.

But as long as you understand the basic principles of a "pipeline" and don't exclude data you want to use in later stages by previous stages, then you generally should be okay.

super helpful Blakes. +1 for explanation too! – user3871 Dec 15 '15 at 16:08 — user3871, Dec 15 '15 at 16:08

Get distinct records with specified fields that match a value, paginated

1 Answers1