140

suppose we have a model in django defined as follows:

class Literal:
    name = models.CharField(...)
    ...

Name field is not unique, and thus can have duplicate values. I need to accomplish the following task: Select all rows from the model that have at least one duplicate value of the name field.

I know how to do it using plain SQL (may be not the best solution):

select * from literal where name IN (
    select name from literal group by name having count((name)) > 1
);

So, is it possible to select this using django ORM? Or better SQL solution?

Chris Martin
  • 30,334
  • 10
  • 78
  • 137
dragoon
  • 5,601
  • 5
  • 37
  • 55

6 Answers6

282

Try:

from django.db.models import Count
Literal.objects.values('name')
               .annotate(Count('id')) 
               .order_by()
               .filter(id__count__gt=1)

This is as close as you can get with Django. The problem is that this will return a ValuesQuerySet with only name and count. However, you can then use this to construct a regular QuerySet by feeding it back into another query:

dupes = Literal.objects.values('name')
                       .annotate(Count('id'))
                       .order_by()
                       .filter(id__count__gt=1)
Literal.objects.filter(name__in=[item['name'] for item in dupes])
John Mee
  • 50,179
  • 34
  • 152
  • 186
Chris Pratt
  • 232,153
  • 36
  • 385
  • 444
  • 8
    Probably you have meant ``Literal.objects.values('name').annotate(name_count=Count('name')).filter(name_count__gt=1)``? – dragoon Jan 24 '12 at 15:28
  • Original query gives ``Cannot resolve keyword 'id_count' into field`` – dragoon Jan 24 '12 at 15:33
  • Sorry, yes. Had some issues with the code, as I was working from memory. Current version of my answer works. – Chris Pratt Jan 24 '12 at 15:43
  • 3
    Thanks for updated answer, I think I will stick with this solution, you can even do it without list comprehension by using ``values_list('name', flat=True)`` – dragoon Jan 24 '12 at 15:45
  • Right. Nice. I always forget about `flat=True` ;). – Chris Pratt Jan 24 '12 at 15:48
  • `FieldError: Join on field 'id' not permitted. Did you misspell 'count' for the lookup type?` – wRAR Nov 02 '12 at 17:37
  • 2
    Django previously had a bug on this (might have been fixed in recent versions) where if you don't specify a fieldname for the `Count` annotation to saved as, it defaults to `[field]__count`. However, that double-underscore syntax is also how Django interprets you wanting to do a join. So, essentially when you try to filter on that, Django thinks you're trying to do a join with `count` which obviously doesn't exist. The fix is to specify a name for your annotation result, i.e. `annotate(mycount=Count('id'))` and then filter on `mycount` instead. – Chris Pratt Nov 02 '12 at 18:33
  • 1
    if you add another call to `values('name')` after your call to annotate, you can remove the list comprehension and say `Literal.objects.filter(name__in=dupes)` which will allow this to all be executed in a single query. – Piper Merriam Mar 01 '13 at 23:58
  • I tried it. It works great except for duplicates where the values are None or blank string. For those, you'd get wrong numbers. – max Jan 06 '17 at 21:58
  • Can it be modified to case insensitive ? case insensitive names ? – Sandeep Balagopal Oct 06 '17 at 12:04
  • This works for me for finding all duplicated items. Is there a way to skip the first duplicated while I want to keep one of them? – Jason Liu Jun 28 '18 at 17:31
  • order_by is not needed – aivarsk Dec 16 '22 at 10:27
61

This was rejected as an edit. So here it is as a better answer

dups = (
    Literal.objects.values('name')
    .annotate(count=Count('id'))
    .values('name')
    .order_by()
    .filter(count__gt=1)
)

This will return a ValuesQuerySet with all of the duplicate names. However, you can then use this to construct a regular QuerySet by feeding it back into another query. The django ORM is smart enough to combine these into a single query:

Literal.objects.filter(name__in=dups)

The extra call to .values('name') after the annotate call looks a little strange. Without this, the subquery fails. The extra values tricks the ORM into only selecting the name column for the subquery.

Daniel Holmes
  • 1,952
  • 2
  • 17
  • 28
Piper Merriam
  • 2,774
  • 2
  • 24
  • 30
  • Nice trick, unfortunately this will only work if just one value is used (eg. if both 'name' and 'phone' where used, the last part wouldn't work). – gdvalderrama Feb 14 '17 at 17:50
  • 2
    What is the `.order_by()` for? – stefanfoulis Mar 24 '17 at 22:01
  • 4
    @stefanfoulis It clears out any existing ordering. If you have a model-set ordering, this becomes part of the SQL `GROUP BY` clause, and that breaks things. Found that out when playing with Subquery (in which you do very similar grouping via `.values()`) – Oli Jun 06 '17 at 09:01
12

try using aggregation

Literal.objects.values('name').annotate(name_count=Count('name')).exclude(name_count=1)
JamesO
  • 25,178
  • 4
  • 40
  • 42
7

In case you use PostgreSQL, you can do something like this:

from django.contrib.postgres.aggregates import ArrayAgg
from django.db.models import Func, Value

duplicate_ids = (Literal.objects.values('name')
                 .annotate(ids=ArrayAgg('id'))
                 .annotate(c=Func('ids', Value(1), function='array_length'))
                 .filter(c__gt=1)
                 .annotate(ids=Func('ids', function='unnest'))
                 .values_list('ids', flat=True))

It results in this rather simple SQL query:

SELECT unnest(ARRAY_AGG("app_literal"."id")) AS "ids"
FROM "app_literal"
GROUP BY "app_literal"."name"
HAVING array_length(ARRAY_AGG("app_literal"."id"), 1) > 1
Eugene Pakhomov
  • 9,309
  • 3
  • 27
  • 53
  • 1
    I tried this but python code gave me an error: `FieldError: Expression contains mixed types: ArrayField, IntegerField. You must set output_field.`. However, SQL query works as expected (Django 3.2) – oglop Oct 04 '21 at 07:28
  • Works great (Django 2.2). Also, you don't need the `array_length` annotation, and can instead filter by `ids__len` - https://docs.djangoproject.com/en/dev/ref/contrib/postgres/fields/#len – a regular fellow Apr 12 '22 at 09:59
2

Ok, so for some reason none of the above worked for, it always returned <MultilingualQuerySet []>. I use the following, much easier to understand but not so elegant solution:

dupes = []
uniques = []

dupes_query = MyModel.objects.values_list('field', flat=True)

for dupe in set(dupes_query):
    if not dupe in uniques:
        uniques.append(dupe)
    else:
        dupes.append(dupe)

print(set(dupes))
Özer
  • 2,059
  • 18
  • 22
0

If you want to result only names list but not objects, you can use the following query

repeated_names = Literal.objects.values('name').annotate(Count('id')).order_by().filter(id__count__gt=1).values_list('name', flat='true')
arulmr
  • 8,620
  • 9
  • 54
  • 69
user2959723
  • 651
  • 2
  • 7
  • 13