I do have a Database with several collections (overall ~15mil documents) and documents look like this (simplified):
{'Text': 'blabla', 'ID': 101}
{'Text': 'Whuppppyyy', 'ID': 102}
{'Text': 'Abrakadabraaa', 'ID': 103}
{'Text': 'olalalaal', 'ID': 104}
{'Text': 'test1234545', 'ID': 104}
{'Text': 'whapwhapwhap', 'ID': 104}
They all have an unique _id field as well, but I want to delete duplicates accodring to another field (the external ID field).
First, I tried a very manual approach with lists and deleting afterwards, but the DB seems too big, takes very long and is not practical.
Second, the following does not work in current MongoDB versions anymore, even though anyone is suggesting it.
db.collection.ensureIndex( { ID: 1 }, { unique: true, dropDups: true } )
So, now I'm trying to create a map reduce solution, but I dont really know what Im doing and especially have difficulty using another field (not the database _id) to find and delete duplicates. Here is my bad first approach (adopted from some interent source):
map = Code("function(){ if(this.fieldName){emit(this.fieldName,1);}}")
reduce = Code("function(key,values) {return Array.sum(values);}")
res = coll.map_reduce(map,reduce,"my_results");
response = []
for doc in res.find():
if(doc['value'] > 1):
count = int(doc['value']) - 1
docs = col.find({"fieldName":doc['ID']},{'ID':1}).limit(count)
for i in docs:
response.append(i['ID'])
coll.remove({"ID": {"$in": response}})
Any help to reduce any duplicates in the external ID field (leaving one entry), would be very much apprechiated ;) Thanks!