0

Mongodb site, https://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/, offers this example of a document for faster look up at daily, monthly level.

{
    _id: "201010/site-1/apache_pb.gif",
    metadata: {
        date: ISODate("2000-10-00T00:00:00Z"),
        site: "site-1",
        page: "/apache_pb.gif" },
    daily: {
        "1": {"sessions": 300, "bounces": 10}
        "2": {"sessions": 100, "bounces": 5},
        "3": {"sessions": 10},
        "4": {"sessions": 100, "bounces": 4},
        ... }
}

For example, to retrieve data for a specific day

   db.stats.monthly.find_one({ },{ 'daily.1': 1 'metadata': 1})

The above schema works great for me as well for most use cases as documented, as it is essentially just lookups.

For the few cases that we might have custom date range is something I'm struggling at, so if the user searches for 1st Jan - 3nd Jan => I'm ideally expecting this result.

[{
        _id: "201010/site-1/apache_pb.gif",
        metadata: {
            date: ISODate("2000-10-00T00:00:00Z"),
            site: "site-1",
            page: "/apache_pb.gif" },
        result: {
           "sessions": 410, "bounces": 15 }
    }, {

}, {
        _id: "201010/site-1/apache_new.gif",
        metadata: {
            date: ISODate("2000-05-00T00:00:00Z"),
            site: "site-1",
            page: "/apache_new.gif" },
        result: {
           "sessions": 310, "bounces": 8 }
    }, {

}...
]

I understand that we need to do aggregation here, but totally confused if it is even possible to aggregation by giving a range for keys in the embedded object.

Will I have to restructure my schema for this to be possible? I really love the efficient lookups, and they serve 80-90% of our use case.

Pratik Bothra
  • 2,642
  • 2
  • 30
  • 44

1 Answers1

3

Your current data storage format does not play well with the aggregation framework or indeed MongoDB queries in general. The core problem is your "daily" object just contains named keys for each item. This means that in order to access anything, MongoDB needs to be given the specific path to that key. Such as "daily.1", just as you mention.

As stated, the aggregation framework and general MongoDB operations cannot "traverse the keys of an object", so you would need server side JavaScript in order to collect data from all the keys.

An approach that is more in line with optimised MongoDB capabilties, is to store the "daily" data in an array instead:

{
    _id: "201010/site-1/apache_pb.gif",
    metadata: {
        date: ISODate("2000-10-00T00:00:00Z"),
        site: "site-1",
        page: "/apache_pb.gif"
    },
    daily: [
        { "day": 1, "sessions": 300, "bounces": 10},
        { "day": 2, "sessions": 100, "bounces": 5},
        { "day": 3, "sessions": 10},
        { "day": 4, "sessions": 100, "bounces": 4}
    ]
}

Then you can run an aggregation over the content quite simply:

db.colllection.aggregate([
    // Match relevant objects 
    { "$match": {
        "daily": { 
            "$elemMatch": { 
                "day": { "$gte": 1, "$lte": 3 } 
            }
        }
    }},

    // Unwind to denormalize array
    { "$unwind": "$daily" },

    // Filter the required results
    { "$match": {
        "daily.day": { "$gte": 1, "$lte": 3 }
    }},
    // Group data and sum totals
    { "$group": {
        "_id": "$_id",
        "metadata": { "$first": "$metadata" },
        "resultSessions": { "$sum": "$daily.sessions" },
        "resultBounces": { "$sum": "$daily.bounces" }
    }},

    // Optionally project to desired format
    { "$project": {
        "metadata": 1,
        "result": {
            "sessions": "$resultSessions",
            "bounces": "$resultBounces"
        }
    }}
])

Or better yet, pre-filter the array before unwinding:

db.colllection.aggregate([
    { "$match": {
        "daily": { 
            "$elemMatch": { 
                "day": { "$gte": 1, "$lte": 3 } 
            }
        }
    }},
    { "$project": {
        "metadata": 1,
        "daily": {
            "$setDifference": [
                { "$map": {
                    "input": "$daily",
                    "as": "day",
                    "in": {
                        "$cond": [
                            { "$and": [
                                { "$gte": [ "$day.day", 1 ] },
                                { "$lte": [ "$day.day", 3 ] }
                            ]},
                            "$day",
                            false
                        ]
                    }
                }},
                [false]
            ]
        }
    }},
    { "$unwind": "$daily" },
    { "$group": {
        "_id": "$_id",
        "metadata": { "$first": "$metadata" },
        "resultSessions": { "$sum": "$daily.sessions" },
        "resultBounces": { "$sum": "$daily.bounces" }
    }},
    { "$project": {
        "metadata": 1,
        "result": {
            "sessions": "$resultSessions",
            "bounces": "$resultBounces"
        }
    }}
])

And please always $match relevant objects first to reduce what is being processed.

As the properties in the data now share all the same paths and are not tied to an outer key they can now be easily accumulated.

Without this structural change, the only way to aggregate on the server is to use mapReduce, which can use a coded function to iterate the object keys:

db.collection.mapReduce(
    function() {
        var result = { "sessions": 0, "bounces": 0 };
        Object.keys(this.daily)
            .filter(function(key) {
               return ( key >= 1 && key <= 3 );
            })
            .forEach(function(key) {
                result.sessions += this.daily[key].sessions;
                result.bounces += this.daily[key].bounces;
            });
        emit(this._id,{ metadata: this.metadata, result: result });
   },
   function() {},  // won't be called for unique _id values
   { 
       "out": { "inline": 1 },
       "query": {
           "daily": { 
               "$elemMatch": { 
                   "day": { "$gte": 1, "$lte": 3 } 
               }
       }
    }},


   }
)

Of course in both cases adjusting the grouping depending on whether you intend to accumulate across documents or not.

And of course if you are not actually accumulating across documents at all, then just to the same type of key traversal in your own client receiving code.

Blakes Seven
  • 49,422
  • 14
  • 129
  • 135
  • Thanks Blakes Seven...Kinda shaken by the different attacks, right now, but I will get back to you in some time. Btw the only reason we're using mongo is to precompute and store stuff there, so that we have efficient lookups. As 95% of our queries, are for specific days, months, weeks - would it be better to go for the initial approach, and just do the map-reduce for the off cases? I'm assuming using the aggregation always is gonna come at a cost, or is the difference in time not going to be noticeable.... – Pratik Bothra Nov 14 '15 at 03:15
  • It really depends on what you are doing. With the alternation it is not a problem to return a "specific" singular match. All you need to is match the `"daily.day"` property in the query and use the [positional `$`](https://docs.mongodb.org/manual/reference/operator/projection/positional/) operator to return that matched element. The other case as mentioned is that if this is always only a single document response ( or filtered from each document only ) then just return the data and cycle the filtering in the client code instead. – Blakes Seven Nov 14 '15 at 03:34
  • One big disadvantage in perhaps both schema design is the update queries....One is update for new days which is to be pushed in that array, the other is updating data already there for a day (like sessions was revised to 105 for the 2nd day) – Pratik Bothra Nov 14 '15 at 05:33
  • The update boils down to first $set, if that fails then $push => http://stackoverflow.com/questions/23470658/mongodb-upsert-sub-document...Thoughts ? – Pratik Bothra Nov 14 '15 at 05:35
  • @PratikBothra Yes it does. But of course there are [Bulk Operations](https://docs.mongodb.org/manual/reference/method/Bulk/) that allow this to be a single request and response, and that of course is another question entirely. – Blakes Seven Nov 14 '15 at 05:38
  • Hey, we finally got round to implementing it in production. But are facing a big blocker, http://stackoverflow.com/questions/35992817/conditionally-project-in-aggregation-with-array-filtering-without-elemmatch. Even the set difference etc doesn't work for me. – Pratik Bothra Mar 14 '16 at 16:44