10

My data looks like this:

{

    "foo_list": [
      {
        "id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
        "name": "Foo 1",
        "slug": "foo-1"
      },
      {
        "id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
        "name": "Foo 1",
        "slug": "foo-1"
      },
      {
        "id": "157569ec-abab-4bfb-b732-55e9c8f4a57d",
        "name": "Foo 3",
        "slug": "foo-3"
      }
    ]
}

Where foo_list is a field in a model called Bar. Notice that the first and second objects in the array are complete duplicates.

Aside from the obvious solution of switching to PostgresSQL, what MongoDB query can I run to remove duplicate entries from foo_list?

Similar answers that do not quite cut it:

These questions answer the question if the array had bare strings in it. However in my situation the array is filled with objects.

I hope it is clear that I am not interested querying the database; I want the duplicates to be gone from the database forever.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
andrewrk
  • 30,272
  • 27
  • 92
  • 113

1 Answers1

16

Purely from an aggregation framework point of view there are a few approaches to this.

You can either just apply $setUnion in modern releases:

 db.collection.aggregate([
     { "$project": { 
         "foo_list": { "$setUnion": [ "$foo_list", "$foo_list" ] }
     }}
 ])

Or more traditionally with $unwind and $addToSet:

db.collection.aggregate([
    { "$unwind": "$foo_list" },
    { "$group": {
        "_id": "$_id",
        "foo_list": { "$addToSet": "$foo_list" }
    }}
])

Or if you were just interested in the duplicates only then by general grouping:

db.collection.aggregate([
    { "$unwind": "$foo_list" },
    { "$group": {
        "_id": {
            "_id": "$_id",
            "foo_list": "$foo_list"
        },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$ne": 1 } } },
    { "$group": {
        "_id": "$_id._id",
        "foo_list": { "$push": "$_id.foo_list" }
    }}
])    

The last form could be useful to you if you actually want to "remove" the duplicates from your data with another update statement as it identifies the elements which are duplicates.

So in that last form the returned result from your sample data identifies the duplicate:

{
    "_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
    "foo_list" : [
            {
                    "id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
                    "name" : "Foo 1",
                    "slug" : "foo-1"
            }
    ]
}

Where results are returned from your collection per document that contains duplicate entries in the array and which entries are duplicated. This is the information you need to update, and you loop the results as you need to specify the update information from the results in order to remove duplicates.

This is actually done with two update statements per document, as a simple $pull operation would remove "both" items, which is not what you want:

var cursor = db.collection.aggregate([
    { "$unwind": "$foo_list" },
    { "$group": {
        "_id": {
            "_id": "$_id",
            "foo_list": "$foo_list"
        },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$ne": 1 } } },
    { "$group": {
        "_id": "$_id._id",
        "foo_list": { "$push": "$_id.foo_list" }
    }}
])    

var batch = db.collection.initializeOrderedBulkOp();
var count = 0;

cursor.forEach(function(doc) {
    doc.foo_list.forEach(function(dup) {
        batch.find({ "_id": doc._id, "foo_list": { "$elemMatch": dup } }).updateOne({
            "$unset": { "foo_list.$": "" }
        });
        batch.find({ "_id": doc._id }).updateOne({ 
            "$pull": { "foo_list": null }
        });
    });
    
    count++;
    if ( count % 500 == 0 ) {
        batch.execute();
        batch = db.collection.initializeOrderedBulkOp();
    }
});

if ( count % 500 != 0 ) {
    batch.execute();
}

That's the modern MongoDB 2.6 and above way to do it, with a cursor result from aggregation and Bulk operations for updates. But the principles remain the same:

  1. Identify the duplicates in documents

  2. Loop the results to issue the updates to the affected documents

  3. Use $unset with the positional $ operator to set the "first" matched array element to null

  4. Use $pull to remove the null entry from the array

So after processing the above operations your sample now looks like this:

{
    "_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
    "foo_list" : [
            {
                    "id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
                    "name" : "Foo 1",
                    "slug" : "foo-1"
            },
            {
                    "id" : "157569ec-abab-4bfb-b732-55e9c8f4a57d",
                    "name" : "Foo 3",
                    "slug" : "foo-3"
            }
    ]
}

The duplicate is removed with the "duplicated" item still intact. That is how you process to identify and remove the duplicate data from your collection.

Damian
  • 68
  • 1
  • 8
Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
  • The second code listing is identical to both of the answers that I gave as examples of answers that do not work, because this is an array of objects, not an array of strings. – andrewrk Aug 21 '14 at 03:27
  • 1
    @andrewrk MongoDB does not care. It just treats them as "things". Of course the code as shown is tested and works. I always ( unless running off somewhere, but not the case here ) test before submitting a response. – Neil Lunn Aug 21 '14 at 03:29
  • @andrewrk Perhaps if you think this does not work or have otherwise tried where it does not, then your objects are not in fact real duplicates as you have presented them. If only one thing such as "id" is "duplicated" then the data is not a real "set" and you have to handle the "de-duplicatation" differently. But this is not how you have presented your question. – Neil Lunn Aug 21 '14 at 03:37
  • It looks like you're saying that the second listing does not actually modify the database but just returns query results without duplicates. Strange, because the other questions I linked claim that the second code listing will actually save the removals to the database. I will try the third code listing and see what happens. – andrewrk Aug 21 '14 at 15:47
  • The third code listing is the closest thing to a helpful answer, because my goal is not to do a query but to actually modify the database and make the duplicates go away forever. I tried it and I get no results from the query. And I do not mean that the query does not remove duplicates; I mean that it literally has no results returned. – andrewrk Aug 21 '14 at 16:24
  • All of Neil's answers are good ways to find the duplicates that look correct to me, and I trust that he has tested them. None of them change the actual documents in the database- that's why he is referring to using the last pipeline to construct updates to the documents to remove duplicates. There's no one step process to remove the duplicates from the documents but Neil's answers are very helpful as one step of a process to remove the duplicates. – wdberkeley Aug 21 '14 at 19:59
  • @andrewrk At no stage did your question or your comments clearly indicate that your intention was to update data, yet somehow I had the foresight to mention that the last approach was needed to identify duplicates in order to process updates. Of course everything works and if it doesn't for you then you are doing something different. Now there is a full sample of how to identify duplicates and process those results, with the results obtained from "your own sample data in your question". BTW Your only upvote on this to date is from me, for at least researching if not understanding. – Neil Lunn Aug 22 '14 at 00:25
  • I'm looking into trying this out with an updated MongoDB server. I have been testing with 2.4.9 and not getting the results that you are talking about. I'm going to try with 2.6.4 and see if I get the expected results. – andrewrk Aug 22 '14 at 04:39
  • 2.6.4 changes nothing. I don't think the count/match part is working. When I remove the `{$match: { count: { $ne: 1 } } },` part, I get all 3 objects as expected. When I add it back in, I get 0. – andrewrk Aug 22 '14 at 18:23
  • I never got this to work. Instead, I changed my code to carefully avoid creating duplicates. – andrewrk Aug 22 '14 at 22:53
  • @andrewrk You are doing something wrong. Everything above is cut and paste on the data you supply yourself. Either just as an aggregation sample or as actually removing from the collection in an update the results appear as shown. I tried to explain to you reasons why it may not work on other data. Your question therefore does not explain what you are trying to do elsewhere. Cut and paste your own example. I'll even "fix" the data for so it is possible to do so. The information will at least be useful to others. – Neil Lunn Aug 22 '14 at 23:33
  • @NeilLunn can u please look in my question i need help https://stackoverflow.com/questions/66570381/mongoose-delete-nested-array-same-key-object – Mohammad Sadman Mar 11 '21 at 08:22