18

I have a mongodb of about 400gb. The documents contain a variety of fields, but the key here is an array of IDs.

So a json file might look like this

{
 "name":"bob"
 "dob":"1/1/2011"
 "key":
      [  
       "1020123123",
       "1234123222",
       "5021297723"
      ]
}

The focal variable here is "key". There is about 10 billion total keys across 50 million documents (so each document has about 200 keys). Keys can repeat, and there are about 15 million UNIQUE keys.

What I would like to do is return the 10,000 most common keys. I thought aggregate might do this, but I'm having a lot of trouble getting it to run. Here is my code:

db.users.aggregate( 
 [ 
  { $unwind : "$key" }, 
  { $group : { _id : "$key", number : { $sum : 1 } } },
  { $sort : { number : -1 } }, 
  { $limit : 10000 }
 ] 
);

Any ideas what I'm doing wrong?

AlexKogan
  • 355
  • 1
  • 5
  • 16

1 Answers1

48

Try this:

db.users.aggregate( 
 [ 
  { $unwind : "$key" }, 
  { $group : { _id : "$key", number : { $sum : 1 } } },
  { $sort : { number : -1 } }, 
  { $limit : 10000 },
  { $out:"result"},
 ], {
  allowDiskUse:true,
  cursor:{}
 }
);

Then find result by db.result.find().

Wizard
  • 4,341
  • 1
  • 15
  • 13
  • 3
    I tried this but got this error:"exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in." Which is weird since I would assume I did opt in using the second part of your code. HMMM – AlexKogan Sep 27 '14 at 00:17
  • @AlexKogan, That's really weird. Could you paste out your codes here? – Wizard Sep 27 '14 at 00:27
  • 1
    Here it is! db.users.aggregate( [ { $unwind : "$likes" }, { $group : { _id : "$likes", number : { $sum : 1 } } }, { $sort : { number : -1 } }, { $limit : 10000 }, { $out:"result"}, ], { allowDiskUse:true, cursor:{} } ) – AlexKogan Sep 27 '14 at 05:59
  • @AlexKogan, The code is correct definitely even though there is a redundant comma `,` before `]`. It's weird to receive that exception message since you have provide `allowDiskUse:true`. All the possibilities I can conceive are that you use lower version of mongo shell (such as V2.4.8) to connect the server of V2.6.4. But you've claimed that you are running V2.6.4. So, is it a bug? :) – Wizard Sep 27 '14 at 14:24
  • Very strange! I'm using robomongo to connect to my remote server for this. I checked the version of mongodb and its 2.6.4. Unless robomongo itself is old? – AlexKogan Sep 27 '14 at 17:52
  • @AlexKogan, It's possible. The server should be V2.6.0+, else it would not return that exception message. I type `db.c.aggregate` in mongo shell V2.4.8 to check its js code and find that it only accept the first parameter if the first parameter is array type. You can use the newest robomongo or mongo shell V2.6.4 to make a test. – Wizard Sep 28 '14 at 00:49
  • I understand what `allowDiskUse: true` does but what `cursor: {}` do? is that necessary – user1870400 Oct 06 '17 at 21:05