3

I'm using mongodb, and have a model that adds comments as embedded documents.

How do I get the average age of comments for an entry? (relative example, my fields vary a little)

So I can have many comments for an entry, and I need to find out the average age of a comment, or the average :cal_date. Additional metrics would be great to gather like the max :cal_date for all entries/comments or per entry...

Does this make sense? Need more detail? I'm happy to oblige to get the solution. I've been confused with date calculations for a while now.

Another way to think of this is using the library book model: There are many books and each book has many checkouts/ins. I need to find the average time that each book is checked out and the average time that all books are out. Again, just metrics, but the fact that these are all dates is confusing.

{
  _id: ObjectId("51b0d94c3f72fb89c9000014"),
  barcode: "H-131887",
  comments: [
    {
      _id: ObjectId("51b0d94c3f72fb89c9000015"),
      cal_date: ISODate("2013-07-03T16:04:57.893Z"),
      cal_date_due: ISODate("2013-07-03T16:04:57.894Z")
    },
    {
      _id: ObjectId("51b0e6053f72fbb27900001b"),
      cal_date: ISODate("2012-07-03T19:39:43.074Z"),
      cal_date_due: ISODate("2013-07-03T19:39:43.076Z"),
      updated_at: ISODate("2013-06-06T19:41:57.770Z"),
      created_at: ISODate("2013-06-06T19:41:57.770Z")
    }
  ],
  created_at: ISODate("2013-06-06T18:47:40.481Z"),
  creator_id: ObjectId("5170547c791e4b1a16000001"),
  description: "",
  maker: "MITUTOYO",
  model: "2046S",
  serial: "QEL228",
  status: "Out",
  updated_at: ISODate("2013-06-07T18:54:38.340Z")
}

One more thing How do I include additional fields in my output using $push? I can get this to work, but it includes, say barcode, twice in an array "barcode" => ["H-131887", "H-131887"]

Kevin Brown
  • 12,602
  • 34
  • 95
  • 155
  • adding the books stuff just confuses things - you have comments array and you want to know average "age" of all comments per blog post - seems clear enough. – Asya Kamsky Jun 13 '13 at 03:55
  • can you clarify for me which part you are having trouble with: running aggregation framework from Ruby or figuring out correct aggregation framework pipeline syntax in Ruby? – Asya Kamsky Jun 15 '13 at 11:43
  • also in your example, which field is the "check-out" date? Is it "comments.cal_date"? And which is "check-in" - cal_date_due may be when the book is due but if that's in the past is it set to returned date? You need to explain what the different fields in the sample document represent as the specific dates you have in them don't make any sense to me without more information. – Asya Kamsky Jun 15 '13 at 13:29
  • map/reduce is not necessary as you haven't mentioned anything that can't be done with aggregation framework and aggregation framework is significantly faster (and simpler) than map/reduce. – Asya Kamsky Jun 15 '13 at 14:34
  • if you want to do $push but eliminate duplicates you would use $addToSet. But if you are just trying to restore the original barcode value, rather than creating an array I recommend using $first expression. I'll update the example solution showing this. – Asya Kamsky Jun 19 '13 at 16:40

1 Answers1

1

You didn't say what time units you want the age in, but I'm just going to show you how to get it back in minutes and trust you can work out how to convert that to any other time grain. I'm going to assume original documents have schema like this:

{ _id: xxx,
  post_id: uniqueId,
  comments: [ { ..., date: ISODate() }, ..., { ... , date: ISODate() } ],
  ...
}

Now the aggregation:

// first you want to define some fixed point in time that you are calculating age from.
// I'm going to use a moment just before "now"
var now = new Date()-1
// unwind the comments array so you can work with individual comments
var unwind = {$unwind:"$comments"};
// calculate a new comment_age value
var project = {$project: {
       post_id:1, 
       comment_age: {
           $divide:[ 
               {$subtract:[now, "$comments.date"]},
               60000
           ]
       }
} };
// group back by post_id calculating average age of comments
var group = {$group: {
               _id: "$post_id",
               age: {$avg: "$comment_age"}
            } };
// now do the aggregation:

db.coll.aggregate( unwind, project, group )

You can use $max, $min, and other grouping function to find oldest and newest comment date or lowest/highest comment age. You can group by post_id or you can group by constant to find these calculations for the entire collection, etc.

* edit * Using the document you included for "library book" as example, this might be the pipeline to calculate for each book that's currently "Out" how long it's been out for, assuming that "comments.cal_date" is when it was checked out and that latest cal_date of all the comments represents the current "check-out" (the older ones having been returned):

 db.coll.aggregate( [
    { $match  : { status : "Out"  } },
    { $unwind : "$comments" },
    { $group  : { _id : "$_id", 
                  cal_date : { $max : "$comments.cal_date" } 
                } 
    },
    { $project : { outDuration : { $divide : [ 
                                     { $subtract : [ 
                                                     ISODate("2013-07-15"), 
                                                     "$cal_date" 
                                                   ] 
                                     },
                                     24*60*60*1000 
                                    ] 
                                  }
                  } 
    },
    { $group : { _id : 1, 
                 avgOut : { $avg : "$outDuration" } 
               } 
    } 
 ] )

What the steps are doing:

  • filtering out documents based on status to make calculation about books that are currently Out only.
  • $unwind to flatten out the "comments" array so that we can
  • find which entry is the latest cal_date with $group and $max.
  • use this max cal_date (which represents when the book was checked out) to subtract it from today's date and divide the result by number of milliseconds in a day to get number of days this book has been out
  • $group all the results together to find the average number of days all the checked-out books have been out.

* edit * I was assuming you knew Ruby and just needed to know how to do an aggregation framework command to calculate date differences/averages/etc. Here is the same code in Ruby using "now" to compare cal_date to (you can also do it using a constant date value:

# get db collection from MongoClient into variable 'coll'
# see basic MongoDB Ruby driver tutorial for details
coll.aggregate([ 
   { "$match"  => {"status"=>"Out"} }, 
   { "$unwind" => "$comments"}, 
   { "$group"  => { "_id" => "$_id", "cal_date" => { "$max" => "$comments.cal_date" } } },
   { "$project"=> {
                    "outDuration" => { 
                       "$divide" => [ 
                            {"$subtract" => [ Time.now, "$cal_date" ] }, 
                            24*60*60*1000
                       ]
                    }
                  }
   },
   { "$group"  => {
          "_id"    => 1,
          "avgOut" => {"$avg"=>"$outDuration"}
     }
   }  
])

See https://github.com/mongodb/mongo-ruby-driver/wiki/Aggregation-Framework-Examples for more examples and explanations.

If there are additional fields that you want to preserve in your $group phase you can add more fields by changing the pipeline step like this:

    { $group  : { _id : "$_id", 
                  barcode  : { $first : "$barcode" },
                  cal_date : { $max : "$comments.cal_date" } 
                } 
    } 

If you don't need the original _id you can just use "$barcode" instead of "$_id" in the first line (that is _id: "$barcode") but since there may be multiple fields you want to preserve, $first trick works with as many of them as you want to keep.

Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133
  • if you want to read more about using dates with aggregation, check out a few articles on my blog: http://www.kamsky.org/stupid-tricks-with-mongodb.html – Asya Kamsky Jun 13 '13 at 04:12
  • I'm trying to work with this, Asya. Can you try to use my structure a little more closely? Maybe that will help me debug a little better. This syntax is still foreign to me. – Kevin Brown Jun 13 '13 at 22:18
  • You provided some sort of library book example - can you actually provide a sample document you are working with? Which part of the syntax are you having trouble with? – Asya Kamsky Jun 15 '13 at 01:28
  • That's the document I'm working with. It's a little tricky working with your example in rails directly because your example is shell input javascript and I need to do this all in my controller, I think. – Kevin Brown Jun 15 '13 at 03:48
  • My example only uses shell JavaScript to create date representing current time - the rest of syntax is the aggregation framework. You can skip all the assignments and just call aggregate with pipeline directly. db.collection.aggregate( [{$unwind:"$comments"}, etc]) – Asya Kamsky Jun 15 '13 at 10:48
  • Asya, how does that help me in my rails project? I don't understand how to use that as an instance variable. – Kevin Brown Jun 15 '13 at 13:49
  • 1
    have you seen this page? https://github.com/ntang/mongo-ruby-driver-wiki/blob/master/Aggregation-Framework-Examples.md - it has some examples of how to construct aggregation framework queries in Ruby - I don't think you made it clear which part you were having a problem with - see my comment on the original question. – Asya Kamsky Jun 15 '13 at 14:02
  • Thanks for your help, Asya. I'm trying to wrap this issue up, but it's not easy to take in at first. You've been very helpful. I've edited the question to include one more small thing...can you assist? – Kevin Brown Jun 19 '13 at 16:36
  • Asya, you've been great. I'll try to clean up my question to be more helpful to others later today. Good answer! – Kevin Brown Jun 19 '13 at 19:12