0

I want to find objects with duplicate values for a specific field, but only if they are related to the same secondary object.

For example, with the following models:

class Blog(models.Model):
    name = models.CharField(max_length=100)
    ...

class Entry(models.Model):
    blog = models.ForeignKey(Blog, on_delete=models.CASCADE)
    headline = models.CharField(max_length=255)
    pub_date = models.DateField()
    ...

How do I select all Entry objects that have the same headline IF they are published in the same Blog. So headlines shared between blogs should not be included. (Bonus: An extra filter that extracts only duplicate headlines with the same publication date, still within the same blog).


The closest solution I have is based on this answer: https://stackoverflow.com/a/8989348/20488494

The first part, obtaining duplicate values, is successful:

dupe_entries_vals = (
    Entry.objects
    .values('blog', 'headline', 'pub_date')
    .order_by()
    .annotate(count=Count('headline'))
    .filter(count__gt=1)
)

But then I can't use the second part because it loses related object filtering. The PKs are also lost due to taking values (necessarily, otherwise the duplicate filtering fails), so I can't use those either.

# No longer restricted to related blog
dupe_entries = Entry.objects.filter(headline__in=[e['headline'] for e in dupe_entries_vals])

# KeyError (id not in `dupe_entries_vals`)
dupe_entries = Entry.objects.filter(pk__in=[e['id'] for e in dupe_entries_vals])

So is there a nice solution for this? Either Django native or a direct SQL query that can be used for most DB backends (if this can't be constructed with Django, please explain)?

1 Answers1

0

If you're using Postgres, you should be able to do something like this:

qs = Entry.objects.all()
qs = qs.values("date", "headline", "blog_id")  # fields for group_by
qs = qs.annotate(ids=ArrayAgg("pk"))  # build a list of id for each group
qs = qs.filter(ids__len__gt=1)  # keep group with duplicates
Alombaros
  • 1,280
  • 1
  • 2
  • 17
  • Thanks for the response. Looks like this works, but it would be nice to have a more robust or generic solution that doesn't rely on specific postgres functionality. Is there something in the Django QuerySet construction possibilities or the nature of my query that prevents this? – ADuckTyping Jul 10 '23 at 12:06
  • You can easily remove the need for Postgres changing the aggregation function by a Count and using the grouping fileds as filters – Alombaros Jul 10 '23 at 15:47
  • How do I avoid losing the pk values in that case? I can't include them in the grouping because that makes them all unique and the count annotation no longer works. – ADuckTyping Jul 10 '23 at 21:42
  • You'll inevitably lose the ids. But you have the values of each grouping fields to use as filters `Entry.objects.filter(date=obj.date, headline=obj.headline, blog_id=obj.blog_id)`. Want do you intend of doing with your duplicates ? – Alombaros Jul 11 '23 at 10:50
  • What is `obj` in your filter? My first query provides me with a QuerySet of the dicts with the values specified and an additional count entry. I intend to delete the duplicates (after checking some other information). – ADuckTyping Jul 11 '23 at 13:57