1

So I would like to de-dup my dataset which has 2 billion records in. I have an index on url, and I want to iterate through each record and see if it's a duplicate.

The index is 110GB

MongoDB.Driver.MongoCommandException: 'Command find failed: Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit..'

My current method won't run because of the Index being huge.

var filter = Builders<Page>.Filter.Empty;
var sort = Builders<Page>.Sort.Ascending("url");
await collection.Find(filter).Sort(sort)
    .ForEachAsync(async document =>
    {
        Console.WriteLine(document.Url);
        //_ = await collection.DeleteOneAsync(a => a.Id == document.Id);
    }
);
Burf2000
  • 5,001
  • 14
  • 58
  • 117
  • 1
    If there is already an index on `{url:1}` it should be using the index instead of an in-memory sort. You might try database profiling to see what query is actually be sent to the server. – Joe May 18 '20 at 23:59
  • Are you saying it should use Url instead of the default ID one? – Burf2000 May 19 '20 at 07:30
  • The 32 MB memory limit for sort doesn't apply when an index is used for sorting. – Joe May 19 '20 at 08:51
  • So how do I do that so I can iterate through the documents – Burf2000 May 19 '20 at 20:26
  • Try using [explain](https://stackoverflow.com/questions/13254784/is-there-an-explain-query-for-mongodb-linq) with the all plans execution option, or perhaps database profiling, to see why it isn't using the `{url:1}` index. – Joe May 20 '20 at 08:18
  • So are you saying when you add an index to a mongo table it then uses that first index by default and not the ID? – Burf2000 May 20 '20 at 14:14
  • The query planner will evaluate all of the index that are relevant to the query to determine which one performs the best. `explain` lets you see what the planner chose, and if you use the allPlansExecution option, you can see comparative performance of the various choices. – Joe May 20 '20 at 18:53

2 Answers2

1

if the goal is to delete duplicate pages with the same url, why not use an aggregation like the following:

db.Page.aggregate(
    [
        {
            $sort: {
                url: 1
            }
        },
        {
            $group: {
                _id: "$url",
                doc: { $first: "$$ROOT" }
            }
        },
        {
            $replaceWith: "$doc"
        },
        {
            $out: "UniquePages"
        }
    ],
    {
        allowDiskUse: 1
    })

it will create a new collection called UniquePages. after inspecting that collection to see if the data is correct, you can simply drop the old Page collection and rename the new one to Page.

Dĵ ΝιΓΞΗΛψΚ
  • 5,068
  • 3
  • 13
  • 26
-1
nayakam
  • 4,149
  • 7
  • 42
  • 62