0

How should I remove duplicate from mongodb collection when there is no unique element?

I want to do this in using Java driver. In that below pic some record are same. I want to remove that records. Time is not unique key here.

enter image description here

P.S.: I just presented data in table form. there are actually in json array form.

Dikesh Gandhi
  • 447
  • 13
  • 22
Hitesh Vaghani
  • 1,342
  • 12
  • 31
  • It's not the key that dictates whether two records are identical, it's rather the contents of the other fields. – Stultuske Apr 07 '15 at 11:19
  • i know that. how should i remove duplicates.? – Hitesh Vaghani Apr 07 '15 at 11:25
  • you should write code that verifies about duplication before writing to the database. – Stultuske Apr 07 '15 at 11:27
  • So you mean that there is now solution for this in mongodb. – Hitesh Vaghani Apr 07 '15 at 11:28
  • I'm saying no such thing. I'm merely saying it's better to "prevent" than to "solve" problems. I'm not familiar with mongodb myself, but most likely, deleting records is possible. But you should question yourself: do you want to go over the tables every evening, manually checking whether there are duplicate records and afterwards manually removing them? – Stultuske Apr 07 '15 at 11:31
  • @Stefan : See first two rows they are same. i want to remove them. I had not given any unique constraints on my collection. Records going to be millions. And there will be so many duplicates would be there. so how to remove them. – Hitesh Vaghani Apr 07 '15 at 11:38
  • @HiteshVaghani: I strongly second stultuske's point of view. However, show us what you tried so far. – Markus W Mahlberg Apr 07 '15 at 11:53
  • I tired dozens of solution i came across. @Stuktuske's point is right. but i have millions of record and checking them while adding to database will be very costly on performance base. I mean that's why we use database to speedup the performance. – Hitesh Vaghani Apr 07 '15 at 12:16

2 Answers2

0

I think you have 2 options here:

  1. Parse your JSON array into a List, sort it based on the timestamp, compare the entries in your list and remove items with a duplicate timestamp (and IP address?). This is also possible using a HashSet if you use an appropriate key, you won't have to do any sorting/comparing yourself, the HashSet won't add objects when the key is already present.
  2. If you have any control over the source of that JSON array, make sure that it doesn't output the same event in the same second twice. Or even better, provide a more accurate timestamp that includes milliseconds. I don't know what those events mean, but maybe it's possible that 2 (or more) of those events are raised from 1 device within 1 second. By removing duplicate items in your JSON array, you can't know that this has happened. This completely depends on the requirements of your software though.
Stefan
  • 1,433
  • 1
  • 19
  • 30
  • Stefan: there is no key, and whether the line is identical, is not based on the info in a key, but in all the fields. – Stultuske Apr 07 '15 at 11:55
  • I know, I would say that the OP would have to create his own unique key so he can filter duplicate values using that key. – Stefan Apr 07 '15 at 11:57
  • Seems to me it would be better trying to use a list or set which doesn't takes duplicate objects. Writing an added key for it might be quite "heavy" on memory, since it must contain all information, and, if not done deep enough, might be error prone – Stultuske Apr 07 '15 at 11:59
  • `Seems to me it would be better trying to use a list or set which doesn't takes duplicate objects.` That is what I suggested. Using a `Map` or `Set` is another way to go which can be tested by the OP. I do agree that it may be prone to errors. – Stefan Apr 07 '15 at 12:05
  • Yes, but what I understand from your explanation is, he should create a key based on everything. I think it's better to have the equals method to verify whether there's already a double of the toAdd object in it. – Stultuske Apr 07 '15 at 12:06
  • Do a `.equals()` comparison on what exactly? The object's instance? I doubt that will work. – Stefan Apr 07 '15 at 12:14
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/74635/discussion-between-stefan-and-stultuske). – Stefan Apr 07 '15 at 12:15
0

I agree with other users here who have pointed out that the presence of duplicate documents might indicate some problem with your application, and that eliminating duplicates before they are inserted is better than trying to clean them up later. You should ensure that the duplicates truly are meaningless and try to identify their source, as a higher priority than cleaning them up.

That said, the meaning of "duplicate" here seems to be "the value of every single field (except _id) is the same". So, to eliminate duplicates, I would do the following:

1 Iterate over every document in the collection, possibly in parallel using a parallel collection scan

2 Compute a hash of all of the non-_id fields

3 Insert a document into another collection representing a set of duplicates

{
    "_id" : #hash#,
    "docs" : [#array of _ids of docs],
    "count" : #number of _ids in docs array#
}

then you'll have a record of all duplicates and you can iterate over this collection and remove all but one of the duplicates, for each document with count > 1. Alternatively, if you don't want to bother to keep a record of the duplicates, you can insert a doc with the hash as _id, and whenever there's a hash collision, delete the current document because it's a duplicate (with high probability).

wdberkeley
  • 11,531
  • 1
  • 28
  • 23
  • While I completely agree with your approach (adding an ID makes things much, much easier), it implies that the OP has control over the software that generates the JSON feed. If that is the case, it is easier (and better) to filter out duplicates before the feed is generated. If it is really about millions of records, you don't want to go throught that whole array to check for duplicates as it could take up quite some time and resources. – Stefan Apr 07 '15 at 18:56