12

I'm trying to get a list of the number of records that have arrays of varying size. I want to get the distribution of array sizes for all records so I can build a histogram like this:

          | *
          | *
documents | *         *
          | *  *      *
          |_*__*__*___*__*___
            2  5  6  23  47

               Array Size

So the raw documents look something like this:

{hubs : [{stuff:0, id:6}, {stuff:1"}, .... ]}
{hubs : [{stuff:0, id:6}]}`

So far using the aggregation framework and some of the help here I've come up with

db.sitedata.aggregate([{ $unwind:'$hubs'}, 
                       { $group : {_id:'$_id', count:{$sum:1}}}, 
                       { $group : {_id:'$count', count:{$sum:1}}},
                       { $sort  : {_id: 1}}])

This seems to give me the results I want, but it's not very fast. I'm wondering if there is something I can do like this that may not need two group calls. The syntax is wrong here, but what I'm trying to do is put the count value in the first _id field:

db.sitedata.aggregate([{ $unwind:'$hubs'}, 
                       { $group : {_id:{$count:$hubs}, count:1}},
                       { $sort  : { _id: 1 }}])
Cristian Lupascu
  • 39,078
  • 16
  • 100
  • 137
Scott
  • 16,711
  • 14
  • 75
  • 120
  • there isn't a way to do this in one pass because you can't start counting/grouping the results until you finish counting the first grouping. What version of MongoDB are you using? – Asya Kamsky Apr 18 '13 at 20:17
  • by the way, how long is it taking? – Asya Kamsky Apr 18 '13 at 22:16
  • Ok, I suspected that may be the case (nesting not allowed, only chaining). It is taking about 4 seconds, which isn't too bad but I was hoping sub second. I can tweak my data instead of tweaking the query now. – Scott Apr 19 '13 at 13:15
  • 3
    right, one approach to speed this up would be to pre-aggregate sums by doing increments of a counter when you push new elements, though there may be other ones - how many documents do you have going into the pipeline (and is it the full collection or are you doing some sort of a $match first?) – Asya Kamsky Apr 19 '13 at 14:21
  • I'm doing no $match first because it's a reference data rollup. It's only about 60,000 records, but quite a bit of pipeline processing. I agree that adding a count to the record is probably the best option for speeding up the query by removing one of the $group's in the pipeline. – Scott Apr 19 '13 at 17:34

1 Answers1

9

Now that 2.6 is out, aggregation framework supports a new array operator $size which will allow you to $project the array size without having to unwind and re-group.

db.sitedata.aggregate([{ $project:{ 'count': { '$size':'$hubs'} } }, 
                       { $group : {_id:'$count', count:{$sum:1} } },
                       { $sort  : { _id: 1 } } ] )
Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133