0

Given a LARGE (hundreds of thousands) collection of event documents (see below for example), what is the most performant method to retrieve the first event with an _id greater than (n) ?

Example Document

{ 
  _id: NumberLong(352757), // Uniqueness guaranteed
  type: "BallDropped",
  createdAt: "2014-01-01T00:00:00Z",
  // ... followed by dynamic properties of unknown size
}

Current Implementation

Given a collection of many events, retrieve the first event with an _id greater than 35.

First, retrieve the id of the event using aggregate.

I do this assuming that the projection phase (return just the id) will be more performant than cycling over full documents of unknown size.

db.events.aggregate(
  { $project: { _id: 1 } },
  { $match: { _id: { $gt: NumberLong(35) } } },
  { $sort: { _id: 1 } },
  { $limit: 1 }
)

Then, I call findOne with the returned _id to retrieve that document.

What are your thoughts?

  • Not really sure what you are asking. Your aggregation statement is basically a straight query with `.find()` And you have the sort order wrong. The greatest value should be sorted in reverse order ( ie. -1). What is the actual problem? – Neil Lunn Apr 09 '14 at 15:47
  • Your aggregate query is correct and will work fine, although it can be accomplished using `find()` (as @NeilLunn mentioned), which is simpler. – Anand Jayabalan Apr 09 '14 at 15:57
  • I think the sort order is correct, I want the _first_ (lowest) document with an id greater than n, not the highest document. This is how I was originally doing it: `db.events.find({ _id: { $gt: 1 } }).sort({ _id: 1 }).limit(1)` But I've seen some posts on here about find/sort being slow on very large collections, and map/reduce or aggregates being preferable. What I'm asking is, what is the most performant way to execute this query given the design concerns I'm working with (documents of unknown size, very large collections, possibly millions of documents). – christianbradley Apr 09 '14 at 15:58
  • To play devil's advocate, what would the difference be on a document where the value being selected is not indexed? How do those queries compare? – christianbradley Apr 09 '14 at 15:59
  • @christianbradley, you can use projection in your find() query to get just the `_id`: `db.events.find({ _id: { $gt: NumberLong(35) } }, {_id:1}).sort({ _id: 1 }).limit(1)` and it would be a covered query. Not having an index on the query criteria will significantly impact the performance. For actual numbers, it would be best to run tests against your collection. – Anand Jayabalan Apr 09 '14 at 16:01
  • @AnandJayabalan, will the projection really save me much in terms of performance if I need to query for that document anyway, or can I just toss it? – christianbradley Apr 09 '14 at 16:12
  • If all you want is the `_id`, then yes, it will definitely save you time. If you have a 100 fields in your document and you want to select just 4 or 5 fields, including the projection is a good practice I think. If you are going to select most of the fields in the document, then don't bother with the projection – Anand Jayabalan Apr 09 '14 at 16:29

0 Answers0