MongoDb & C# : Using a Cursor with a sort on a large Index

Question

So I would like to de-dup my dataset which has 2 billion records in. I have an index on url, and I want to iterate through each record and see if it's a duplicate.

The index is 110GB

MongoDB.Driver.MongoCommandException: 'Command find failed: Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit..'

My current method won't run because of the Index being huge.

var filter = Builders<Page>.Filter.Empty;
var sort = Builders<Page>.Sort.Ascending("url");
await collection.Find(filter).Sort(sort)
    .ForEachAsync(async document =>
    {
        Console.WriteLine(document.Url);
        //_ = await collection.DeleteOneAsync(a => a.Id == document.Id);
    }
);

If there is already an index on `{url:1}` it should be using the index instead of an in-memory sort. You might try database profiling to see what query is actually be sent to the server. — Joe, May 18 '20 at 23:59
Are you saying it should use Url instead of the default ID one? — Burf2000, May 19 '20 at 07:30
The 32 MB memory limit for sort doesn't apply when an index is used for sorting. — Joe, May 19 '20 at 08:51
Try using [explain](https://stackoverflow.com/questions/13254784/is-there-an-explain-query-for-mongodb-linq) with the all plans execution option, or perhaps database profiling, to see why it isn't using the `{url:1}` index. — Joe, May 20 '20 at 08:18
So are you saying when you add an index to a mongo table it then uses that first index by default and not the ID? — Burf2000, May 20 '20 at 14:14
The query planner will evaluate all of the index that are relevant to the query to determine which one performs the best. `explain` lets you see what the planner chose, and if you use the allPlansExecution option, you can see comparative performance of the various choices. — Joe, May 20 '20 at 18:53

score 1 · Answer 1 · answered May 19 '20 at 03:44

if the goal is to delete duplicate pages with the same url, why not use an aggregation like the following:

db.Page.aggregate(
    [
        {
            $sort: {
                url: 1
            }
        },
        {
            $group: {
                _id: "$url",
                doc: { $first: "$$ROOT" }
            }
        },
        {
            $replaceWith: "$doc"
        },
        {
            $out: "UniquePages"
        }
    ],
    {
        allowDiskUse: 1
    })

it will create a new collection called UniquePages. after inspecting that collection to see if the data is correct, you can simply drop the old Page collection and rename the new one to Page.

@Burf2000 i suppose you don't have 9TB of free disk space? :-) — Dĵ ΝιΓΞΗΛψΚ, May 19 '20 at 12:48

nayakam · Answer 2 · 2020-05-20T03:01:06.037

-1

Make sure that sort operation uses an index. This will mostly perform better and memory sort has a restriction on a maximum of 32MB. Refer to doc. and check doc uses indexes to sort query results.
Use query plan to identify the selected / winningPlan execution plan
Analyse query performance details with executionStats

edited May 20 '20 at 03:01

answered May 19 '20 at 02:13

nayakam

4,149
7
42
62

Not sure this answers the actual question> I am using an index that's 110GB – Burf2000 May 19 '20 at 07:28
This is to diagnose and analyse the reported error. – nayakam May 20 '20 at 03:00

MongoDb & C# : Using a Cursor with a sort on a large Index

2 Answers2