4

first of all i'm new to mongo so I don't know much and i cannot just remove duplicate rows due to some dependencies.

I have following data stored in mongo

{'id': 1, 'key': 'qscderftgbvqscderftgbvqscderftgbvqscderftgbvqscderftgbv', 'name': 'some name', 'country': 'US'},
{'id': 2, 'key': 'qscderftgbvqscderftgbvqscderftgbvqscderftgbvqscderftgbv', 'name': 'some name', 'country': 'US'},
{'id': 3, 'key': 'pehnvosjijipehnvosjijipehnvosjijipehnvosjijipehnvosjiji', 'name': 'some name', 'country': 'IN'},
{'id': 4, 'key': 'pfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnew', 'name': 'some name', 'country': 'IN'},
{'id': 5, 'key': 'pfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnew', 'name': 'some name', 'country': 'IN'}

you can see some of the rows are duplicate with different id as long as it will take to solve this issue from input I must tackle it on output.

I need the data in the following way:

{'id': 1, 'key': 'qscderftgbvqscderftgbvqscderftgbvqscderftgbvqscderftgbv', 'name': 'some name', 'country': 'US'},
{'id': 3, 'key': 'pehnvosjijipehnvosjijipehnvosjijipehnvosjijipehnvosjiji', 'name': 'some name', 'country': 'IN'},
{'id': 4, 'key': 'pfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnew', 'name': 'some name', 'country': 'IN'}

My query

keys = db.collection.distinct('key', {})
all_data = db.collection.find({'key': {$in: keys}})

As you can see it takes two queries for a same result set Please combine it to one as the database is very large

I might also create a unique key on the key but the value is so long (152 characters) that it will not help me.

Or it will??

Cybersupernova
  • 1,833
  • 1
  • 20
  • 37
  • Possible duplicate of [Remove Duplicates from MongoDB](http://stackoverflow.com/questions/31557053/remove-duplicates-from-mongodb) – Blakes Seven Mar 08 '16 at 21:14
  • Maybe not "exactly" in the "removal" part, but certainly in the identification part. Basically you need to `$group` on the common field(s) and then return any other ancillary fields with `$first` – Blakes Seven Mar 08 '16 at 21:15

1 Answers1

5

You need to use the aggregation framework for this. There are multiple ways to do this, the solution below uses the $$ROOT variable to get the first document for each group:

db.data.aggregate([{
  "$sort": {
    "_id": 1
  }
}, {
  "$group": {
    "_id": "$key",
    "first": {
      "$first": "$$ROOT"
    }
  }
}, {
  "$project": {
    "_id": 0,
    "id":"$first.id",
    "key":"$first.key",
    "name":"$first.name",
    "country":"$first.country"
  }
}])
Alex
  • 21,273
  • 10
  • 61
  • 73