1

I'm considering bundling time-sequence data together in session documents. Inside each session, there would be an array of events. Each event would have a timestamp. I know that I can create a multikey index on the timestamp of those events, but I'm curious what mechanism MongoDB uses to prevent the same document from showing up twice in one query.

To clarify, imagine a collection of sessions with the following documents:

{
  _id: 'A',
  events: [
    {time: '10:00'},
    {time: '15:00'}
  ]
}
{
  _id: 'B',
  events: [
    {time: '12:00'}
  ]
}

If I add a multikey index with db.sessions.ensureIndex({'events.time' : 1}), I would expect the b-tree of that index to look like this:

'10:00' => 'A'
'12:00' => 'B'
'15:00' => 'A'

If I query the collection with {'events.time': {$gte: '10:00'}}, MongoDB scans the b-tree and returns:

{ "_id" : "A", "events" : [  {  "time" : "10:00" },  {  "time" : "15:00" } ] }
{ "_id" : "B", "events" : [  {  "time" : "12:00" } ] }

How does Mongo prevent document A from showing up a second time as the third result in the cursor? For small index scans, it could just keep track of which documents had already been seen, but what happens if the index is enormous? Is there ever a case where the same document would show up more than once in a singe cursor?

My assumption is that it would not. Mongo could look at the document it is scanning and detect that it already would have matched earlier in the scan by inspecting earlier entries in the indexed array. However, I cannot find any mention of this behavior in the MongoDB documentation, and it is important to actually know what to expect.

(NOTE: I do know that it is possible for a document to show up in a single query more than once if the document is modified while the cursor is being scanned. That shouldn't pose a problem for queries on time-sequence data where timestamps are never edited. Even if a new event is added to a session during a scan, if Mongo uses something like the detection mechanism I mentioned above, it should be able to omit the moved document from query results.)

  • interesting question, but if the index size is enormous then query optimizer will prefer a collection scan (BasicCursor) since that will have a lower nscanned. – Asya Kamsky Nov 06 '13 at 04:34
  • Not necessarily. I could be scanning only 1 month of data which could include a very large number of sessions but still far fewer sessions than the total in the collection. In that case, you would definitely want Mongo to use the index. – Will Conant Nov 06 '13 at 19:02

1 Answers1

0

I cannot find any mention of this behavior in the MongoDB documentation, and it is important to actually know what to expect.

Internals of implementation are seldom mentioned in the documentation, and after all, what you describe is the expected behavior.

There is code to deduplicate a result set and there are tests to make sure that it's working correctly. After all, a multi-key index isn't the primary use case for such functionality - if you have an $or clause in your query, the results must be de-duplicated as well.

Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133
  • Oh yeah! I hadn't even thought about the case of an $or clause scanning two different indexes. So it looks like an unordered set of visited disk locations is kept for the lifetime of the cursor. Does this mean that the document could show up a second time if it is resized during the scan? That could make it unsafe to add a new event while crunching a bunch of sessions into a report. – Will Conant Nov 07 '13 at 01:37
  • yes, if the document moves then it can show up twice - unless you're walking an index in which case you won't encounter it again. Crunching a bunch of things into report is usually of "past" data - so something changing in the "now" shouldn't affect it. – Asya Kamsky Nov 09 '13 at 21:52