1

I'm analyzing texts. Those texts have annotations (e.g. "chapter", "scenery", ...). Those annotations are in my MongoDB collection annotations, e.g.

{
  start: 1,
  stop: 10000,
  type: chapter,
  details: {
    number: 1,
    title: "Where it all began"
  }
},
{
  start: 10001,
  stop: 20000,
  type: chapter,
  details: {
    number: 2,
    title: "Lovers"
  }
},
{
  start: 1,
  stop: 5000,
  type: scenery,
  details: {
    descr: "castle"
  }
},
{
  start: 5001,
  stop: 15000,
  type: scenery,
  details: {
    descr: "forest"
  }
}

Challenge 1: For a given position in the text, I'd like find all annotations. For example querying for character 1234 should tell me, that

  • it is within chapter one
  • it takes place in the castle

Challenge 2: I also like to query for ranges. For example querying for characters form 9800 to 10101 should tell me, that it touches chapter 1, chapter 2 and the scenery forest.

Challenge 3: Comparable to challenge 2 I'd like to match only those annotations that are completely covered by the query-range. For example querying for characters form 9800 to 30000 should only return the document chapter 2.

For challenge 1 I tried to simply use $lt and $gt. e.g.:

db.annotations.find({start: {$lt: 1234}, stop: {$gt: 1234}});

But I realized, that only indexes for the key start is used, even if I have a compound index for start and stop. Is there a way to create more adequate indexes for the three problems I mentioned?

I shortly thought of geospatial indexes, but I haven't used them, yet. I also only need a one-dimensional version of it.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
R_User
  • 10,682
  • 25
  • 79
  • 120

1 Answers1

3

For Challenge 1, the query you are using is appropriate, though you might want to use $lte and $gte to be inclusive.

db.annotations.find({ "start": { "$lt": 1234 }, "stop": { "$gt": 1234 }});

Regarding indexes, the reason it chooses to use the index on start instead of the compound index has to do with the tree structure of compound indexes, which is nicely explained by Rob Moore in this answer. Note that it can still use the compound index if you use hint(), but the query optimiser finds it faster to use the index on start and then weed out the results that don't match the range for the stop clause.

For Challenge 2, you just need to use an explicit $or clause to cover the cases where stop is within the bounds, when start is within the bounds and when start and stop encompass the bounds.

db.annotations.find({
    "$or": [
        { "stop": { "$gte": 9800, "$lte": 10101 }},
        { "start": { "$gte": 9800, "$lte": 10101 }},
        { "start": { "$lt": 9800 }, "stop": { "$gt": 10101 }}
    ]
});

For Challenge 3, you can use a query very similar to the one in Challenge 1, but ensuring that the documents are completely covered by the given bounds.

db.annotations.find({ "start": { "$gte": 9800 }, "stop": { "$lte": 30000 }});
Community
  • 1
  • 1
Juan Carlos Farah
  • 3,829
  • 30
  • 42
  • I could not find an answer by rob moore in the link you posted (wrong link?). hint(): is mongo also using the informatin for `start` AND `stop`? or is it only using the information for the `start: {$lte: ...}`. – R_User Feb 28 '15 at 00:05
  • Sorry. Fixed the link now. I don't know if with `hint()` the optimiser will use both bits of information or just the prefix. You could try profiling both queries and see if there is any difference. – Juan Carlos Farah Feb 28 '15 at 00:20