$lt and $gt queries are significantly faster than $eq on un-indexed numeric field

Question

    const result = await sandbox
      .aggregate([
        {
          $match: {
            B: { $eq: 500 },
            // B: { $lt: 500 }, faster
            // B: { $gt: 500 }, faster
          },
        },
        { $limit: 1000 },
      ])

B is and un-indexed numeric, ranging randomly from 1 to 1000, field in a collection of 1.5m documents.

The queries for $gt and $lt are about 5 - 10 times faster than $eq. I have repeatedly tested it.

What explains this difference?

What are the explain plans? – Wernfried Domscheit Dec 28 '22 at 07:13 — Wernfried Domscheit, Dec 28 '22 at 07:13

score 2 · Accepted Answer · answered Dec 28 '22 at 20:08

The distribution of the data and the (full) size of the documents that match the queries.

I believe the behavior that you are describing is expected given the scenario that was outlined. If anything, I would expect the performance difference to actually be larger than the one that was mentioned.

As there is no index for the database to utilize for this operation, it will be forced to do a collection scan. Now let's consider what that collection scan will look like during its execution. With 1.5 million documents and 1,000 distinct (and random) values, we can expect that there are approximately 1,500 documents which contain each individual value. Based on this we would expect there would be about 1,500 total documents in the collection that match the query with the equality condition but about half of the documents (750,000) to match the other queries with the range conditions.

Assuming a perfectly random distribution where each value showed up exactly once per 1,000 documents, the following would happen when the database began scanning the collection for each of the queries:

Equality - each document that the database scans will have a 0.1% chance (1/1000) of matching the filter. As the query is requesting a result set size of 1,000 documents, we can expect the database to have to scan result_set_size / chance_of_matching = 1,000 / .001 = 1 million documents (a full 2/3s of the collection) before it completes.
Range - each document that the database scans will have about a 50% chance (500/1000) of matching the filter. Using the equation from before, we now expect the database to scan result_set_size / chance_of_matching = 1,000 / .5 = 2,000 documents before it completes.

That means that the database is potentially doing 3 orders of magnitude more work to satisfy the equality query than the range query. This is why my surprise is that there isn't more than a 10x difference in query duration.

Demonstration

I can't create enough documents on mongoplayground to appropriately demonstrate, so I did the following in my local environment. First I created the collection via:

var bulk = db.foo.initializeUnorderedBulkOp();
for(i = 0; i < 1500000; i++){bulk.insert({_id:i, B: Math.floor(Math.random()*1000)})}
bulk.execute();

This resulted in the following distribution of data:

> db.foo.count({B:500})
1592
> db.foo.count({B:{$gt:500}})
747873
> db.foo.count({B:{$lt:500}})
750535

Then we can use explain() with executionStats to determine the amount of work that the database had to do to execute these queries. For the equality:

> db.foo.aggregate([ { $match: { B: { $eq: 500 }  } }, { $limit: 1000 }]).explain("executionStats").executionStats
{
  nReturned: 1000,
  executionTimeMillis: 332,
  totalKeysExamined: 0,
  totalDocsExamined: 915535,
  ...

And for the range queries:

> db.foo.aggregate([ { $match: { B: { $gt: 500 }  } }, { $limit: 1000 }]).explain("executionStats").executionStats
{
  nReturned: 1000,
  executionTimeMillis: 1,
  totalKeysExamined: 0,
  totalDocsExamined: 2047,
  ...
> db.foo.aggregate([ { $match: { B: { $lt: 500 }  } }, { $limit: 1000 }]).explain("executionStats").executionStats
{
  nReturned: 1000,
  executionTimeMillis: 1,
  totalKeysExamined: 0,
  totalDocsExamined: 1957,

All of these numbers are right in line with what were expecting based on the earlier musings.

Aside

By the way, this configuration helps demonstrate why indexes are most helpful when retrieving a small subset of results. If you were to add an index on the B field, I would expect both of the following things to happen:

The performance of the equality query to improve substantially.
The performance of the range queries to be worse than when no index is present.

There is a cost associated with traversing an index prior to retrieving the associated data. It is often the case that a query which matches (very roughly) about 20% or more of the collection will perform better by scanning the full collection rather than using an index.

This means that `{ $limit: 1000 }` is used during the `$match` stage? But they should be independent? — Bear Bile Farming is Torture, Dec 28 '22 at 20:21
I'm not sure I fully understand that question/statement. The database will attempt to reduce the amount of work it has to do whenever it can do so without changing the results requested by the user. It would be wasteful for the database to retrieve _all_ of the matching results just to then immediately throw the excess away due to the `$limit`. So it will do that pruning as early as possible. Regardless, removing the `$limit` would just further exacerbate the problem. This can be observed via the `explain` output when the `$limit` stage is removed from the pipeline. — user20042973, Dec 28 '22 at 20:26
If you were thinking that the `$match` would do a full collection scan for all of the queries then that would explain the question overall. The `$limit` absolutely plays an important role in preventing that hence leading to the difference in performance. If you were to use a value that is not present for the equality predicate then the database would indeed do a full collection scan and should take longer than the check against the value of `500` — user20042973, Dec 28 '22 at 20:31

$lt and $gt queries are significantly faster than $eq on un-indexed numeric field

1 Answers1

Demonstration

Aside