How to delete rows with duplicate columns in django

Question

I have a model

class Foo(models.Model): 
     first = models.CharField()
     second = models.CharField()

data I have is

first second
1     2
1     2
1     2
3     4

Now I want to delete all duplicate rows and keep one entry. The end result

first second
1     2
3     4

How do I do this? I checked this question but could not figure it out properly. Annotate

I have tried

foo_ids = Foo.objects.annotate(first_c=Count('first'), second_c=Count('second')).filter(first_c__gt=1, second_c__gt=1).values('first', 'second', 'id')

Then try and figure out how to not delete one of each I list of duplicates.

Here's a great post answering your question. https://stackoverflow.com/a/5879542/8326247 — Daniel J, Jul 18 '17 at 15:29

score 1 · Accepted Answer · answered Jul 18 '17 at 15:50

1

I ended up doing this.

from django.db.models import Count
duplicate_foo = Foo.objects.values('req_group','child_program__id').annotate(id_c=Count('id')).filter(id_c__gt=1)
for dups in duplicate_foo:
    for i, val in enumerate(Foo.objects.filter(first=dups['first'],                                                      
                                               second=dups['second'])):
        if i ==0:
            continue
        val.delete()

Not the most optimzed solution. But it works.

answered Jul 18 '17 at 15:50

Akamad007

1,551
3
22
38

Speed depends on how many duplicates you have. If most of your records are not duplicated, this is faster than Exprator's answer. – Eduard Luca Nov 25 '19 at 14:00
Calling delete() on every single duplicate still triggers one DB query for each duplicate. That's a massive overhead if you deal with thousands of duplicates. – LukasKlement May 19 '22 at 13:40

score 0 · Answer 2 · answered May 19 '22 at 13:38

It's an older thread, but both answers don't fit the bill for large datasets and lead to a huge number of queries.

You can use this generic method instead:

from django.apps import apps
from django.db.models import Count, Max


def get_duplicates_from_fields(app, model, fields, delete=False, **filters):
    """
    Returns duplicate records based on a list of fields,
    optionally deletes duplicate records.
    """
    Model = apps.get_model(app, model)
    duplicates = (Model.objects.filter(**filters).values(*fields)
                  .order_by()
                  .annotate(_max_id=Max('id'), _count=Count('id'))
                  .filter(_count__gt=1))

    for duplicate in duplicates:
        if delete:
            (
                Model.objects
                .filter(**filters)
                .filter(**{x: duplicate[x] for x in fields})
                .exclude(id=duplicate['_max_id'])
                .delete()
            )
        else:
            print(duplicate)

You can use this method as such:

get_duplicates_from_fields('myapp', 'Foo', ['first', 'second'], True)

This lets you find and delete duplicate records based on any number of fields.

How to delete rows with duplicate columns in django

2 Answers2