Fastest way to remove duplicate documents in mongodb

Question

I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:

{
    _id: 14124412,
    nodes: [
        12345,
        54321
        ],
    name: "Some beauty"
}

Document is duplicate if it has at least one node same as another document with same name. What is the fastest way to remove duplicates?

score 105 · Answer 1 · edited Jun 18 '20 at 11:39

dropDups: true option is not available in 3.0.

I have solution with aggregation framework for collecting duplicates and then removing in one go.

It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.

a. Remove all documents in one go

var duplicates = [];

db.collectionName.aggregate([
  { $match: { 
    name: { "$ne": '' }  // discard selection criteria
  }},
  { $group: { 
    _id: { name: "$name"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }},
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )
})

// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);     

// Remove all duplicates in one go    
db.collectionName.remove({_id:{$in:duplicates}})

b. You can delete documents one by one.

db.collectionName.aggregate([
  // discard selection criteria, You can remove "$match" section if you want
  { $match: { 
    source_references.key: { "$ne": '' }  
  }},
  { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates
})

Thank you for helpful answer ! I did find it better, when you have lot's of rows(i've 5M rows) that it's better to create a counter and limit it for each 10K and not for the entire duplicates cause it may be just too big :) — Mazki516, Jun 17 '16 at 13:59
This looks great! Do you have any performance recommendation? I have about 3M rows with few dups. Is it better to do in one go (your solution a), or one by one? — Nico, Sep 20 '16 at 13:17
For large number of documents, you should delete documents at less traffic time (say midnight if targeted to some timezone). You can take approach `b` delete one by one document. — Somnath Muluk, Sep 20 '16 at 13:27
Fantastic script, thanks! You can add {allowDiskUse: true} after the ] if your collection is very large and/or you don't have an index on the _id fields. I also had to remove the .result ; in which circumstances would one need it? — Nico, Sep 30 '16 at 16:33
@NicoDurand: Added disk usage option for faster processing. .result was coming for older version or in Robomongo. Removed it now. — Somnath Muluk, Oct 01 '16 at 07:29
NOTE: _$addToSet does not guarantee a particular ordering of elements in the modified set_ as per https://docs.mongodb.com/manual/reference/operator/update/addToSet/ ...use `$push` if you'd like to preserve order ...ALSO NOTE: `$addToSet` will not add duplicates to the array whereas `$push` will — Benjamin Hoffman, Mar 21 '17 at 18:25

JohnnyHK · Accepted Answer · 2015-04-26T00:59:11.037

52

Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})

As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.

UPDATE

This solution is only valid through MongoDB 2.x as the dropDups option is no longer available in 3.0 (docs).

edited Apr 26 '15 at 00:59

answered Jan 06 '13 at 17:00

JohnnyHK

305,182
66
621
471

Name doesn't need to be unique on its own. Will this delete it only if both name and at least one node are same? – ewooycom Jan 06 '13 at 17:08
4

@user1188570 It is compound so both of the fields must have a duplicate in the same document – Sammaye Jan 06 '13 at 17:14
@Sammaye I think its better solution to merge nodes, is there anything like action: {$merge: nodes} instead of dropDups? How would you achieve this? – ewooycom Jan 06 '13 at 22:36
@user1188570 Hmm that is a complicated task and no easy feat. The only really way I can think of atm is to do it the hard way of pulling it out client side and doing each row one by one atm. You could do a MR and emit to a separate collection and then use that and drop the first collection mind – Sammaye Jan 06 '13 at 22:57
2

You may want to remove the index after you've let it delete duplicates: `db.test.dropIndex({name: 1, nodes: 1})` – Meekohi Feb 11 '14 at 22:44
The API has now changed to `db.accounts.createIndex( { "tax-id": 1 }, { unique: true, dropDups: true } )` – Chenna V Mar 10 '16 at 14:51
It doesn't eliminate duplicates in 2.6. – Ruben Apr 15 '17 at 15:16

score 38 · Answer 3 · answered Jul 01 '16 at 06:42

38

Create collection dump with mongodump

Clear collection

Add unique index

Restore collection with mongorestore

answered Jul 01 '16 at 06:42

dhythhsba

972
2
11
21

1

This was by far the easiest way for me to do this - a couple minutes of downtime to save the stress of having to run an unfamiliar query. – rococo Sep 15 '17 at 20:11
1

Thanks, can I clarify that restoring the collection after adding a unique index means that there will be no error when a duplicate entry is tried? – memebrain Nov 10 '17 at 23:26
One of the easiest way to remove duplicate, highly recommended – Qazi Apr 28 '20 at 19:38

score 18 · Answer 4 · answered Jun 13 '17 at 13:13

I found this solution that works with MongoDB 3.4: I'll assume the field with duplicates is called fieldX

db.collection.aggregate([
{
    // only match documents that have this field
    // you can omit this stage if you don't have missing fieldX
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
    $replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})

Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.

It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).

The next stage groups documents by fieldX, and only inserts the $first document in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.

I had to add allowDiskUse because my collection is large.

You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "

You can save the results to a new collection by adding an $out stage...

Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:

db.collection.aggregate([
{
    // only match documents that have this field
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})

how can i this code used for multiple keys (duplications of fieldA and fieldB) ?? — babeyh, May 09 '20 at 01:50
unless you have most entries are duplicated. This is very inefficient. Think about that you have only 1/10000 duplicates. With your method you have to copy 1TB to get rid of 100MB duplicates. — Wang, Jul 09 '21 at 03:18

score 12 · Answer 5 · answered Aug 30 '21 at 01:31

12

The following Mongo aggregation pipeline does the deduplication and outputs it back to the same or different collection.

collection.aggregate([
  { $group: {
    _id: '$field_to_dedup',
    doc: { $first: '$$ROOT' }
  } },
  { $replaceRoot: {
    newRoot: '$doc'
  } },
  { $out: 'collection' }
], { allowDiskUse: true })

answered Aug 30 '21 at 01:31

Mihailoff

577
5
6

this was awesome! and really quick with a collection of over 1m documents – Jonathan Pitter Aug 13 '22 at 00:31
Will this safely work if my collection size 30 GB but I don't have available disk space of the same size? Works great on a small sample, though. Thanks. – Сүхболд A. Apr 05 '23 at 03:14

score 4 · Answer 6 · answered Sep 22 '20 at 17:13

My DB had millions of duplicate records. @somnath's answer did not work as is so writing the solution that worked for me for people looking to delete millions of duplicate records.

/** Create a array to store all duplicate records ids*/
var duplicates = [];

/** Start Aggregation pipeline*/
db.collection.aggregate([
  {
    $match: { /** Add any filter here. Add index for filter keys*/
      filterKey: {
        $exists: false
      }
    }
  },
  {
    $sort: { /** Sort it in such a way that you want to retain first element*/
      createdAt: -1
    }
  },
  {
    $group: {
      _id: {
        key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
      },
      dups: {
        $push: {
          _id: "$_id"
        }
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        "$gt": 1
      }
    }
  }
],
{
  allowDiskUse: true
}).forEach(function(doc){
  doc.dups.shift();
  doc.dups.forEach(function(dupId){
    duplicates.push(dupId._id);
  })
})

/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
    temparray = duplicates.slice(i,i+chunk);
    db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}

score 2 · Answer 7 · answered Aug 23 '17 at 12:42

2

Here is a slightly more 'manual' way of doing it:

Essentially, first, get a list of all the unique keys you are interested.

Then perform a search using each of those keys and delete if that search returns bigger than one.

  db.collection.distinct("key").forEach((num)=>{
    var i = 0;
    db.collection.find({key: num}).forEach((doc)=>{
      if (i)   db.collection.remove({key: num}, { justOne: true })
      i++
    })
  });

answered Aug 23 '17 at 12:42

Fernando

379
3
10

2

OP document is duplicated if it has a least one node same as another document with same name so `.distinct("key")` is not suitable. – Florent Jul 03 '18 at 15:03

score 2 · Answer 8 · answered Jul 09 '21 at 10:26

tips to speed up, when only small portion of your documents are duplicated:

you need an index on the field to detect duplicates.
$group does not use the index, but it can take advantage of $sort and $sort use the index. so you should put a $sort step at the beginning
do inplace delete_many() instead of $out to new collection, this will save lots of IO time and disk space.

if you use pymongo you can do:

index_uuid = IndexModel(
    [
        ('uuid', pymongo.ASCENDING)
    ],
)
col.create_indexes([index_uuid])
pipeline = [
    {"$sort": {"uuid":1}},
    {
        "$group": {
            "_id": "$uuid",
            "dups": {"$addToSet": "$_id"},
            "count": {"$sum": 1}
        }
    },
    {
        "$match": {"count": {"$gt": 1}}
    },
]
it_cursor = col.aggregate(
    pipeline, allowDiskUse=True
)
# skip 1st dup of each dups group
dups = list(itertools.chain.from_iterable(map(lambda x: x["dups"][1:], it_cursor)))
col.delete_many({"_id":{"$in": dups}})

performance

I test it on a database contain 30M documents and 1TB large.

Without index/sort it takes more than an hour to get the cursor (I do not even have the patient to wait for it).
with index/sort but use $out to output to a new collection. This is safer if your filesystem does not support snapshot. But it requires lots of disk space and takes more than 40mins to finish despite the fact that we are using SSDs. It will be much slower if you are on HDD RAID.
with index/sort and inplace delete_many, it takes around 5mins in total.

sanair96 · Answer 9 · 2019-01-21T06:08:43.547

The following method merges documents with the same name while only keeping the unique nodes without duplicating them.

I found using the $out operator to be a simple way. I unwind the array and then group it by adding to set. The $out operator allows the aggregation result to persist [docs]. If you put the name of the collection itself it will replace the collection with the new data. If the name does not exist it will create a new collection.

Hope this helps.

allowDiskUse may have to be added to the pipeline.

db.collectionName.aggregate([
  {
    $unwind:{path:"$nodes"},
  },
  {
    $group:{
      _id:"$name",
      nodes:{
        $addToSet:"$nodes"
      }
  },
  {
    $project:{
      _id:0,
      name:"$_id.name",
      nodes:1
    }
  },
  {
    $out:"collectionNameWithoutDuplicates"
  }
])

score 1 · Answer 10 · edited Sep 04 '19 at 15:29

Using pymongo this should work.

Add the fields that need to be unique for the collection in unique_field

unique_field = {"field1":"$field1","field2":"$field2"}

cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)

slice the dups array depending on the duplications count(here i had only one extra duplicate for all)

items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})

this is very run unless you have maximum 1 duplicates per item. — Wang, Jul 09 '21 at 09:20

score 1 · Answer 11 · answered Jul 03 '20 at 12:01

1

I don't know whether is it going to answer main question, but for others it'll be usefull. 1.Query the duplicate row using findOne() method and store it as an object.

const User = db.User.findOne({_id:"duplicateid"});

2.Execute deleteMany() method to remove all the rows with the id "duplicateid"

db.User.deleteMany({_id:"duplicateid"});

3.Insert the values stored in User object.

db.User.insertOne(User);

Easy and fast!!!!

answered Jul 03 '20 at 12:01

Rakshith HR

17
2

wonderful answer, I was just looking for something like this – Partha Paul May 26 '22 at 05:16

Kundan Sharma · Answer 12 · 2021-01-15T03:29:53.753

First, you can find all the duplicates and remove those duplicates in the DB. Here we take the id column to check and remove duplicates.

db.collection.aggregate([
    { "$group": { "_id": "$id", "count": { "$sum": 1 } } },
    { "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } },
    { "$sort": { "count": -1 } },
    { "$project": { "name": "$_id", "_id": 0 } }
]).then(data => {
    var dr = data.map(d => d.name);
    console.log("duplicate Recods:: ", dr);
    db.collection.remove({ id: { $in: dr } }).then(removedD => {
        console.log("Removed duplicate Data:: ", removedD);
    })
})

amateur · Answer 13 · 2017-11-30T01:56:04.717

General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/ to retrieve one random id from the duplicate records in the collection.
Delete all the records in the collection other than the random-id that we retrieved from findOne option.

You can do something like this if you are trying to do it in pymongo.

def _run_query():

        try:

            for record in (aggregate_based_on_field(collection)):
                if not record:
                    continue
                _logger.info("Working on Record %s", record)

                try:
                    retain = db.collection.find_one(find_one({'fie1d1': 'x',  'field2':'y'}, {'_id': 1}))
                    _logger.info("_id to retain from duplicates %s", retain['_id'])

                    db.collection.remove({'fie1d1': 'x',  'field2':'y', '_id': {'$ne': retain['_id']}})

                except Exception as ex:
                    _logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))

        except Exception as e:
            _logger.error("Mongo error when deleting duplicates %s", str(e))


def aggregate_based_on_field(collection):
    return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])

From the shell:

Replace find_one to findOne
Same remove command should work.

Dhanushka Dayawansha · Answer 14 · 2023-04-07T06:29:47.463

I had to remove 3M duplicate record and i did the following things in Mongo shell

use your_DB_name

for (var i = 1; i <= 30; i++) {
var data = db.collectionName.aggregate([
     {"$group" : { "_id":"$yourGroupById" , "count": { "$sum": 1 },"data": { "$push": "$$ROOT" }}},
  {
       $project:
          {
            result: { $slice: [ "$data", { $subtract: [{ $size: "$data" }, 1] } ] },
      count:"$count"
          }
     },{"$unwind": "$result"},{ $limit : 100000 }
],{allowDiskUse:true}).toArray()

data = data.map(r=>r.result._id)
db.collectionName.deleteMany({_id:{$in:data}})
data = []
}

Fastest way to remove duplicate documents in mongodb

14 Answers14

performance

Linked

Related