13

I have a model with three fields

class MyModel(models.Model):
    a    = models.ForeignKey(A)
    b    = models.ForeignKey(B)
    c    = models.ForeignKey(C)

I want to enforce a unique constraint between these fields, and found django's unique_together, which seems to be the solution. However, I already have an existing database, and there are many duplicates. I know that since unique_together works at the database level, I need to unique-ify the rows, and then try a migration.

Is there a good way to go about removing duplicates (where a duplicate has the same (A,B,C)) so that I can run migration to get the unique_together contstraint?

jkeesh
  • 3,289
  • 3
  • 29
  • 42

2 Answers2

29

If you are happy to choose one of the duplicates arbitrarily, I think the following might do the trick. Perhaps not the most efficient but simple enough and I guess you only need to run this once. Please verify this all works yourself on some test data in case I've done something silly, since you are about to delete a bunch of data.

First we find groups of objects which form duplicates. For each group, (arbitrarily) pick a "master" that we are going to keep. Our chosen method is to pick the one with lowest pk

from django.db.models import Min, Count

master_pks = MyModel.objects.values('A', 'B', 'C'
    ).annotate(Min('pk'), count=Count('pk')
    ).filter(count__gt=1
    ).values_list('pk__min', flat=True)

we then loop over each master, and delete all its duplicates

masters = MyModel.objects.in_bulk( list(master_pks) )

for master in masters.values():
    MyModel.objects.filter(a=master.a, b=master.b, c=master.c
        ).exclude(pk=master.pk).del_ACCIDENT_PREVENTION_ete()
chris Frisina
  • 19,086
  • 22
  • 87
  • 167
second
  • 28,029
  • 7
  • 75
  • 76
  • 1
    Can we do something similar in migration file itself which would avoid having to run extra script? – chhantyal Oct 23 '13 at 07:36
  • 3
    What is `.del_ACCIDENT_PREVENTION_elte()`? – Dusty May 27 '15 at 15:09
  • 17
    `delete` with the phrase `ACCIDENT_PREVENTION` added in the middle to prevent people accidentally deleting stuff by copy/pasting the code without reading it – second May 27 '15 at 21:45
  • @chhantyal Just put it in a migration file. Look at the Django docs for data migrations. I just did one, simple as can be. – Aaron McMillin Jun 28 '17 at 18:24
0

I want to add a slightly improved answer that will delete everything in a single query, instead of looping and deleting for each duplicate group. This will be much faster if you have a lot of records.

non_dupe_pks = list(
    Model.objects.values('A', 'B', 'C')
    .annotate(Min('pk'), count=Count('pk'))
    .order_by()
    .values_list('pk__min', flat=True)
)

dupes = Model.objects.exclude(pk__in=non_dupe_pks)
dupes.delete()

It's important to add order_by() in the first query otherwise the default ordering in the model might mess up with the aggregation.

You can comment out the last line and use dupes.count() to check if the query is working as expected.

Misao
  • 11
  • 1