83

I have over 300k records in one collection in Mongo.

When I run this very simple query:

db.myCollection.find().limit(5);

It takes only few miliseconds.

But when I use skip in the query:

db.myCollection.find().skip(200000).limit(5)

It won't return anything... it runs for minutes and returns nothing.

How to make it better?

Talha Awan
  • 4,573
  • 4
  • 25
  • 40
Radek Simko
  • 15,886
  • 17
  • 69
  • 107

9 Answers9

106

One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.

So if you start with

db.myCollection.find().limit(100).sort({created_date:true});

and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query

db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true}); 
Jalasem
  • 27,261
  • 3
  • 21
  • 29
Russell
  • 12,261
  • 4
  • 52
  • 75
  • 4
    This seems really good. Why don't I see more people suggesting this? – steve Jun 27 '12 at 04:05
  • 12
    Well, it's limited in that you can only go one page forward or back at a time, rather than skip to a specific page, but for this limited use case I think it works well. – Russell Jun 27 '12 at 09:59
  • Cool idea. If you don't really care about a sort (other than for this purpose), and don't want to sort or create an index, seems like maybe you could skip a step and take advantage of the ObjectId (_id) field which will be indexed...it's possible object ids get recycled but again that wouldn't matter in this case. – jsh Mar 26 '13 at 20:53
  • 2
    I found this answer that uses similar mechanism as above: http://stackoverflow.com/a/9704204/1015147 It might be helpful. – imekinox Sep 03 '13 at 16:06
  • 15
    This approach should be used with caution, since it will cause results to be omitted if they have the same `created_date` as `max_created_date_from_last_result`. See https://sammaye.wordpress.com/2012/05/25/mongodb-paging-using-ranged-queries-avoiding-skip/ which offers the solution of using a compound index on `_id` and the timestamp. – Pixel Elephant May 19 '15 at 17:56
  • 4
    This is a reasonable approach but not perfect as you can't jump pages. One limitation I see, what if the records are sorted by name, example: `product.name`. I am really surprised to see no real support for `skip` in MangoDB, this is a deal breaker for me. – GETah May 12 '17 at 05:21
  • instead of created_at how about using ObjectId as the last document returned, i think that would be much better – Joel Joseph Jul 17 '20 at 10:13
88

From MongoDB documentation:

Paging Costs

Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.

Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.

You have to ask yourself a question: how often do you need 40000th page? Also see this article;

Jordi
  • 2,055
  • 1
  • 16
  • 34
Tomasz Nurkiewicz
  • 334,321
  • 69
  • 703
  • 674
  • 52
    it's an INDEX, it should be instantaneous, the worst about this is that if you have 1 million documents, and skip 10 million it still take over a minute, this is hard to understand, doesn't mongo know at any time how many documents a collection have ? but my main point is, why is it slow to get the nth document ? isn't that why we use databases in the first place ? – Martijn Scheffer Mar 13 '18 at 15:14
  • 1
    When you do a naked find (cute huh), it actually doesn't use an index at all (since you have nothing as a query/filter)... so it is doing a complete COLSCAN examining each item (doc) along the way. You could try this trick to at least use the default ID index... db.myCollection.find({'_id': {'$gt': ''}}).skip(20000).limit(5) Something like that will at least remove the large doc scan, sticking to an index scan. When using skip/limit, you have to use filters that are completely contained in the index in order to stick to the index scan, else it has to go to docs to figure out things. – xEverybodyx Jan 21 '21 at 14:31
  • 1
    Instead of using _id for the query, you can also just sort by _id to trigger index use. db.myCollection.find().skip(20000).sort({"_id":1}).limit(5).explain("executionStats") – xEverybodyx Jan 21 '21 at 14:36
  • 1
    @xEverybodyx I have this experience in .skip slowing down severely even when an index is used. – Adam Moisa Aug 08 '21 at 03:55
11

I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.

The steps I take are:

  1. Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
  2. Know the starting value, page size and the page you want to jump to
  3. Project + skip + limit the value you should start from
  4. Find + limit the page's results

It looks roughly like this if I want to get page 5432 of 16 records (in javascript):

let page = 5432;
let page_size = 16;
let skip_size = page * page_size;

let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;

retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();

This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.

Mr. T
  • 12,795
  • 5
  • 39
  • 47
7

i connected two answer.

the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.

db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });

In PHP it will be

$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
            'sort' => array('_id' => 1),
            'limit' => $limit, 
            'skip' => $skip,
            
        ];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);
Makyen
  • 31,849
  • 12
  • 86
  • 121
Kamil Dąbrowski
  • 984
  • 11
  • 17
  • 1
    You have literally saved our project! To sort by date took approx 25 seconds, while by id it took some milliseconds. In our case, there is a coincidence between id and date. Thank you very much! – Fabio Ricci Sep 12 '22 at 09:11
2

My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.

After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.

The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.

On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)

Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.

Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.

Curious Sam
  • 884
  • 10
  • 12
  • Brilliant solution - I implemented exactly this. Works wonderfully for pagination and jumping to a page. Don't forget to add an index. – Nick Grealy Jun 06 '23 at 16:12
2

I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.

Novaterata
  • 4,356
  • 3
  • 29
  • 51
0

If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.

As stated from the official mongo docs:

The skip() method requires the server to scan from the beginning of the input results set before beginning to return results. As the offset increases, skip() will become slower.

Range queries can use indexes to avoid scanning unwanted documents, typically yielding better performance as the offset grows compared to using skip() for pagination.

Descending order (example):

function printStudents(startValue, nPerPage) {
  let endValue = null;
  db.students.find( { _id: { $lt: startValue } } )
             .sort( { _id: -1 } )
             .limit( nPerPage )
             .forEach( student => {
               print( student.name );
               endValue = student._id;
             } );
  return endValue;
}

Ascending order example here.

0

If you know the ID of the element from which you want to limit.

db.myCollection.find({_id: {$gt: id}}).limit(5)

This is a lil genious solution which works like charm

Hyzyr
  • 568
  • 4
  • 13
0

For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.

Here is an example where I'm querying over tons of documents using spring boot:

Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page

for(int i=0; i<(totalElements/pageSize); i++) {
  page +=1;
  Aggregation aggregation = Aggregation.newAggregation(
    Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
    Aggregation.sort(Sort.Direction.ASC,"_id"),
    new CustomAggregationOperation(queryOffersByProduct),
    Aggregation.limit((long)pageSize)
  );

  List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();

  lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}