Django select only rows with duplicate field values

Question

suppose we have a model in django defined as follows:

class Literal:
    name = models.CharField(...)
    ...

Name field is not unique, and thus can have duplicate values. I need to accomplish the following task: Select all rows from the model that have at least one duplicate value of the name field.

I know how to do it using plain SQL (may be not the best solution):

select * from literal where name IN (
    select name from literal group by name having count((name)) > 1
);

So, is it possible to select this using django ORM? Or better SQL solution?

score 282 · Accepted Answer · edited Jan 22 '17 at 22:27

282

Try:

from django.db.models import Count
Literal.objects.values('name')
               .annotate(Count('id')) 
               .order_by()
               .filter(id__count__gt=1)

This is as close as you can get with Django. The problem is that this will return a ValuesQuerySet with only name and count. However, you can then use this to construct a regular QuerySet by feeding it back into another query:

dupes = Literal.objects.values('name')
                       .annotate(Count('id'))
                       .order_by()
                       .filter(id__count__gt=1)
Literal.objects.filter(name__in=[item['name'] for item in dupes])

edited Jan 22 '17 at 22:27

John Mee

50,179
34
152
186

answered Jan 24 '12 at 15:24

Chris Pratt

232,153
36
385
444

8

Probably you have meant ``Literal.objects.values('name').annotate(name_count=Count('name')).filter(name_count__gt=1)``? – dragoon Jan 24 '12 at 15:28
Original query gives ``Cannot resolve keyword 'id_count' into field`` – dragoon Jan 24 '12 at 15:33
Sorry, yes. Had some issues with the code, as I was working from memory. Current version of my answer works. – Chris Pratt Jan 24 '12 at 15:43
3

Thanks for updated answer, I think I will stick with this solution, you can even do it without list comprehension by using ``values_list('name', flat=True)`` – dragoon Jan 24 '12 at 15:45
Right. Nice. I always forget about `flat=True` ;). – Chris Pratt Jan 24 '12 at 15:48
`FieldError: Join on field 'id' not permitted. Did you misspell 'count' for the lookup type?` – wRAR Nov 02 '12 at 17:37
2

Django previously had a bug on this (might have been fixed in recent versions) where if you don't specify a fieldname for the `Count` annotation to saved as, it defaults to `[field]__count`. However, that double-underscore syntax is also how Django interprets you wanting to do a join. So, essentially when you try to filter on that, Django thinks you're trying to do a join with `count` which obviously doesn't exist. The fix is to specify a name for your annotation result, i.e. `annotate(mycount=Count('id'))` and then filter on `mycount` instead. – Chris Pratt Nov 02 '12 at 18:33
1

if you add another call to `values('name')` after your call to annotate, you can remove the list comprehension and say `Literal.objects.filter(name__in=dupes)` which will allow this to all be executed in a single query. – Piper Merriam Mar 01 '13 at 23:58
I tried it. It works great except for duplicates where the values are None or blank string. For those, you'd get wrong numbers. – max Jan 06 '17 at 21:58
Can it be modified to case insensitive ? case insensitive names ? – Sandeep Balagopal Oct 06 '17 at 12:04
This works for me for finding all duplicated items. Is there a way to skip the first duplicated while I want to keep one of them? – Jason Liu Jun 28 '18 at 17:31
order_by is not needed – aivarsk Dec 16 '22 at 10:27

score 61 · Answer 2 · edited Jul 01 '19 at 12:44

61

This was rejected as an edit. So here it is as a better answer

dups = (
    Literal.objects.values('name')
    .annotate(count=Count('id'))
    .values('name')
    .order_by()
    .filter(count__gt=1)
)

This will return a ValuesQuerySet with all of the duplicate names. However, you can then use this to construct a regular QuerySet by feeding it back into another query. The django ORM is smart enough to combine these into a single query:

Literal.objects.filter(name__in=dups)

The extra call to .values('name') after the annotate call looks a little strange. Without this, the subquery fails. The extra values tricks the ORM into only selecting the name column for the subquery.

edited Jul 01 '19 at 12:44

Daniel Holmes

1,952
2
17
28

answered Apr 13 '13 at 14:21

Piper Merriam

2,774
2
24
30

Nice trick, unfortunately this will only work if just one value is used (eg. if both 'name' and 'phone' where used, the last part wouldn't work). – gdvalderrama Feb 14 '17 at 17:50
2

What is the `.order_by()` for? – stefanfoulis Mar 24 '17 at 22:01
4

@stefanfoulis It clears out any existing ordering. If you have a model-set ordering, this becomes part of the SQL `GROUP BY` clause, and that breaks things. Found that out when playing with Subquery (in which you do very similar grouping via `.values()`) – Oli Jun 06 '17 at 09:01

score 12 · Answer 3 · answered Jan 24 '12 at 15:26

12

try using aggregation

Literal.objects.values('name').annotate(name_count=Count('name')).exclude(name_count=1)

answered Jan 24 '12 at 15:26

JamesO

25,178
4
40
42

Ok, that gives the corrent list of names, but is it possible to selects ids and other fields at the same time? – dragoon Jan 24 '12 at 15:35
@dragoon - no but Chris Pratt has covered the alternative in his answer. – JamesO Jan 24 '12 at 15:48

score 7 · Answer 4 · answered Mar 09 '17 at 07:22

7

In case you use PostgreSQL, you can do something like this:

from django.contrib.postgres.aggregates import ArrayAgg
from django.db.models import Func, Value

duplicate_ids = (Literal.objects.values('name')
                 .annotate(ids=ArrayAgg('id'))
                 .annotate(c=Func('ids', Value(1), function='array_length'))
                 .filter(c__gt=1)
                 .annotate(ids=Func('ids', function='unnest'))
                 .values_list('ids', flat=True))

It results in this rather simple SQL query:

SELECT unnest(ARRAY_AGG("app_literal"."id")) AS "ids"
FROM "app_literal"
GROUP BY "app_literal"."name"
HAVING array_length(ARRAY_AGG("app_literal"."id"), 1) > 1

answered Mar 09 '17 at 07:22

Eugene Pakhomov

9,309
3
27
53

1

I tried this but python code gave me an error: `FieldError: Expression contains mixed types: ArrayField, IntegerField. You must set output_field.`. However, SQL query works as expected (Django 3.2) – oglop Oct 04 '21 at 07:28
Works great (Django 2.2). Also, you don't need the `array_length` annotation, and can instead filter by `ids__len` - https://docs.djangoproject.com/en/dev/ref/contrib/postgres/fields/#len – a regular fellow Apr 12 '22 at 09:59

Özer · Answer 5 · 2021-03-20T15:36:13.213

2

Ok, so for some reason none of the above worked for, it always returned <MultilingualQuerySet []>. I use the following, much easier to understand but not so elegant solution:

dupes = []
uniques = []

dupes_query = MyModel.objects.values_list('field', flat=True)

for dupe in set(dupes_query):
    if not dupe in uniques:
        uniques.append(dupe)
    else:
        dupes.append(dupe)

print(set(dupes))

edited Mar 20 '21 at 15:36

answered Mar 20 '21 at 15:18

Özer

2,059
18
22

score 0 · Answer 6 · edited Jan 22 '14 at 07:47

0

If you want to result only names list but not objects, you can use the following query

repeated_names = Literal.objects.values('name').annotate(Count('id')).order_by().filter(id__count__gt=1).values_list('name', flat='true')

edited Jan 22 '14 at 07:47

arulmr

8,620
9
54
69

answered Dec 30 '13 at 04:52

user2959723

651
2
7
13

Django select only rows with duplicate field values

6 Answers6

Linked

Related