1

I have Article model that contain a keywords field which is an ArrayField of a list of keywords in sorted ascending order. I want to do a query that finds all articles that have a minimum amount of keywords overlapping.

Example:

article_a = Article(keywords=["tag1", "tag2", "tag3"]
article_b = Article(keywords=["tag1", "tag2", ]
article_c = Article(keywords=["tag1", ]

article_a.find_similar_articles(min_overlap=2)
  # Returns [article_b, ] since it overlaps with at least 2 elements.

There is a similar question here that is for general Postgres, not for the Django ORM.

Anyone know how I can query the Array Field in this way? Or perhaps you have a suggestion of another way to achieve the same result by structuring the data in another manner?

Marcus Lind
  • 10,374
  • 7
  • 58
  • 112
  • I worked on this for a bit, but couldn't quite figure it out yet. Here are a few notes if someone wants to follow along. The `intarray` extension is a dead-end since it's designed only for integers. You can do this in Postgres using `UNNEST` and `INTERSECT` -- see https://stackoverflow.com/questions/756871/postgres-function-to-return-the-intersection-of-2-arrays#5834576 -- which in theory could translated into the Django ORM, but I've not yet been able to make that work. https://gist.github.com/jacobian/4b65b3fbb7103829b941e23c506c822a is what I came up with so far, that doesn't quite work. – jacobian Nov 04 '18 at 14:31
  • Thank you for putting in the effort to look into this. Do you think that perhaps ArrayField is the wrong field type for this kind of lookups? Perhaps Keywords should be a ManyToOne relationship to Article, and then there's some way to do normal Django aggregation to count the overlapping keywords. Thoughts? – Marcus Lind Nov 06 '18 at 14:54
  • I was sorta assuming this was a simplified example from something more complex. If you're literally just doing tagging, you might want to look into [django-taggit](https://django-taggit.readthedocs.io/en/latest/), which has a [similar_objects](https://django-taggit.readthedocs.io/en/latest/api.html#TaggableManager.similar_objects) function built-in that does I think exactly what you want. (Taggit does seem to use many-to-one relations, fwiw.) – jacobian Nov 07 '18 at 00:15
  • 3
    I have just run into the same issue, but I want to count the number of overlaps for sorting purposes rather than filter-- more matches => higher up in the list of results – Maus Aug 02 '19 at 20:33
  • Anyone figured out a solution to this yet? My problem is exactly the same as @Maus. – duplxey Jul 20 '22 at 20:21
  • @duplxey I have posted a working solution to address this issue please let me know if that solves your problem. – Charlesthk Apr 18 '23 at 06:19

1 Answers1

0

Given your model :

from django.contrib.postgres.fields import ArrayField
from django.db import models

class Article(models.Model):

    keywords = ArrayField(models.CharField(max_length=50), default=list)

You can get the overlapping keywords and length using a RawSQL expression :

from django.db.models.expressions import RawSQL
from django.contrib.postgres.fields import ArrayField
from django.db.models import CharField, Func, F, IntegerField

article = Article.objects.first()

similar_articles = (
    Article.objects.annotate(
        overlap=RawSQL(
            sql="ARRAY(select UNNEST(%s) INTERSECT select UNNEST(keywords))",
            params=(article.keywords,),
            output_field=ArrayField(CharField(max_length=50)),
        )
    ).annotate(
        overlap_count=Func(F("overlap"), function="CARDINALITY", output_field=IntegerField())
    ).filter(overlap_count__gte=2) # min_overlap of 2
)

My answer is heavily inspired on this SO answer which explains how to do it on Postgresql.

Charlesthk
  • 9,394
  • 5
  • 43
  • 45