1

What I am trying to do:

I have models Topic and Entry. Entry has a ForeignKey to topic. I need to list topics on condition that the user has entries in it (created in last 24 hours). I also need to annotate count, it needs to be the total number of entries created after the last entry written by the user. (To put it more thoroughly, you can think of an inbox where you have list of conversations with number of unread messages.)

This's what I have come up with:

relevant_topics = (
    Entry.objects.filter(author=user, date_created__gte=time_threshold(hours=24))
    .values_list("topic__pk", flat=True)
    .order_by()
    .distinct()
)

qs = (
    Topic.objects.filter(pk__in=relevant_topics).annotate(
        latest=Max("entries__date_created", filter=Q(entries__author=user)),
        count=Count("entries", filter=Q(date_created__gte=F("latest__date_created"))),
    )
).values("title", "count")

Which will throw:

FieldError: Cannot resolve keyword 'date_created' into field. Join on 'latest' not permitted.

I don't really know if Django itself doesn't support what I've written, or my solution is faulty. I thought adding count using .extra(), but I couldn't figure out how to use latest annotation there. I would really appreciate any query that produces the expected output.

Reference data set:

(assume the current user = Jack)

<User username: Jack>
<User username: John>

<Topic title: foo>
<Topic title: bar>
<Topic title: baz>

(Assume higher pk = created later.)

<Entry pk:1 topic:foo user:Jack>
<Entry pk:2 topic:foo user:Jack> (date_created in last 24 hours)
<Entry pk:3 topic:foo user:John> (date_created in last 24 hours)

<Entry pk:4 topic:bar user:Jack> (date_created in last 24 hours)

<Entry pk:5 topic:baz user:John> (date_created in last 24 hours)

Given the dataset, the output should only be:

<Topic:foo count:1>

EDIT:

To give you an idea, here is a raw SQL solution which produces correct output:

    pk = user.pk
    threshold = time_threshold(hours=24)

    with connection.cursor() as cursor:
        cursor.execute(
            """
        select
          s.title,
          s.slug,
          s.count
        from
          (
            select
              tt.title,
              tt.slug,
              e.count,
              e.max_id
            from
              (
                select
                  z.topic_id,
                  count(
                    case when z.id > k.max_id then z.id end
                  ) as count,
                  k.max_id
                from
                  dictionary_entry z
                  inner join (
                    select
                      topic_id,
                      max(de.id) as max_id
                    from
                      dictionary_entry de
                    where
                      de.date_created >= %s
                      and de.author_id = %s
                    group by
                      author_id,
                      topic_id
                  ) k on k.topic_id = z.topic_id
                group by
                  z.topic_id,
                  k.max_id
              ) e
              inner join dictionary_topic tt on tt.id = e.topic_id
          ) s
        where
          s.count > 0
        order by
          s.max_id desc
        """,
            [threshold, pk],
        )
        # convert to dict
        columns = [col[0] for col in cursor.description]
        return [dict(zip(columns, row)) for row in cursor.fetchall()]
suayip uzulmez
  • 628
  • 8
  • 23
  • Wouldn't it just be `F("latest")`? You may have to chain the annotations if you want to use an annotated field in another annotation – Iain Shelvington Apr 03 '20 at 19:05
  • No, I think `F("latest")` just refers to Entry itself. (Using it like that throws `OperationalError: misuse of aggregate function MAX()`, chaining annotations also doesn't help.) – suayip uzulmez Apr 03 '20 at 19:15
  • 1
    Could you provide Topic and Entry models? – Anton Pomieshchenko Apr 06 '20 at 04:58
  • Yes. Full models: https://github.com/realsuayip/django-sozluk/blob/master/dictionary/models/entry.py#L19 and https://github.com/realsuayip/django-sozluk/blob/master/dictionary/models/topic.py#L25 – suayip uzulmez Apr 06 '20 at 13:36
  • Can you please also provide an expected outcome? – KrysotL Apr 09 '20 at 12:15
  • @KrysotL what do you exactly mean by expected outcome? I think I properly explained the scenario. Do you need any specific details? – suayip uzulmez Apr 09 '20 at 14:44
  • I had mind some input and output data. Just a snippet, which would help me write a test. I presume that the answer of @Paul Rene provides correct outputs, so I test against that. – KrysotL Apr 09 '20 at 19:52

2 Answers2

2

This can be achieved in 1 SQL query in database by

  1. filtering the relevant entries (the important bit is the OuterRef which "transfers" the filter to the topics),
  2. grouping the entries by topic and using count, and then
  3. annotating the topics using a Subquery.

One can find some info about that in Django docs.

For your case the following should produce the desired result.

from django.db.models import Count, IntegerField, OuterRef, Subquery

relevant_topics = (
    models.Entry.objects.filter(
        author=user, date_created__gte=time_threshold(24), topic=OuterRef("pk"),
    )
    .order_by()
    .values("topic")
    .annotate(Count("id"))
    .values("id__count")
)

qs = models.Topic.objects.annotate(
    entries_count=Subquery(relevant_topics, output_field=IntegerField())
).filter(entries_count__gt=0)

Hope this helps :-)

Edit 1:

I think I misunderstood the question and forgot to take into account the fact that it is the entries of the other authors that need to be counted (after the last one of the current author).

So, I came up with the following, which gives the same results as the answer of @Paul Rene:

latest_in_topic = (
    Entry.objects.filter(author=user, date_created__gte=time_threshold(24), topic=OuterRef("topic"))
    .values("topic")
    .annotate(latest=Max("date_created"))
)

qs = (
    Entry.objects.annotate(
        latest=Subquery(latest_in_topic.values("latest"), output_field=DateTimeField())
    )
    .filter(date_created__gte=F("latest"))
    .values("topic", "topic__title")
    .annotate(Count("id"))
)

res = [(t["topic__title"], t["id__count"]) for t in qs]

Edit 2: The ORM produces the following query (obtained by str(qs.query)). Maybe, there will be some clue how to improve the performance.

SELECT "entry"."topic_id", "topic"."title", COUNT("entry"."id") AS "id__count"
FROM "entry"
         INNER JOIN "topic" ON ("entry"."topic_id" = "topic"."id")
WHERE "entry"."date_created" > (SELECT MAX(U0."date_created") AS "latest"
                                    FROM "entry" U0
                                    WHERE (U0."author_id" = 1 AND U0."date_created" >= '2020-04-09 16:31:48.407501+00:00' AND U0."topic_id" = ("entry"."topic_id"))
                                    GROUP BY U0."topic_id")
GROUP BY "entry"."topic_id", "topic"."title";
KrysotL
  • 94
  • 9
  • I tried to get it work, bu no success. There are various problems: qs annotates all topics which is a problem, so I added date and user filter in topic. This produced duplicate results which could be avoided by distinct() which is undesired as I need to order the topics. If you inspect, count is totally wrong because it gives the count of all entries written in that topic by user. So, your code can be rephrased as: ```Topic.objects.annotate( count=Count("entries", filter=Q(date_created__gte=time_threshold(24)) & Q(entries__author=user)) ).filter(count__gt=0) ``` – suayip uzulmez Apr 09 '20 at 14:41
  • I probably misunderstood the question. Sorry for that. Tried to improve the answer, please see the edit 1. – KrysotL Apr 09 '20 at 19:49
  • Thanks for your attention. But this also hangs the database. In 'qs', are we certain that only the selected entries (i.e. entries of latest_in_topic) get filtered out? (Earlier, I tried very close implementation, but the problem was I couldn't select the right topics.) I also think instead of annotating entry, we should use Topic instead (using related lookup 'entries'). Also, I edited my question to include a little data snippet, I hope it helps. – suayip uzulmez Apr 09 '20 at 21:17
  • In order to test that only the right `topics` are selected, try enumerating the `latest_in_topic` queryset (just remove the `, topic=OuterRef("topic")` part). The result should be a list of topics, where the current user authored an entry in the last 24 hours, annotated with the `date_created` of the latest `entry` for each `topic`. – KrysotL Apr 10 '20 at 16:52
  • Regarding the what should be annotated: the main information (the count of the `entries`) can be obtained only from the `entries` table, thus, the less we touch the `topics` table the simpler (faster) query we get. – KrysotL Apr 10 '20 at 16:55
  • Maybe an index on `entries.date_created` column would improve the performance. If this does not help, I'd try to narrow the query by selecting fewer topics. You can also try running `qs.explain()` on real data to see, where the bottlenecks of the query are (seq scans and nested loops are the most costly operations). – KrysotL Apr 10 '20 at 17:01
  • Also there the option of running a raw query. Or having a side process that computes the result and saves it either to db or to some cache. But both methods are kind of last resort. – KrysotL Apr 10 '20 at 17:09
  • I mean, if you reference Subquery do you get to select related topics from 'latest_in_topic '? Otherwise there is no filtering whatsoever to select right topics. Also I want to touch topics table because I need to access its fields. In both cases each table will get the same hit, so semantically I think annotating topics is better. Earlier, selecting fewer topics actually worked for me (i.e. selecting topics that were created in last 24 hours), but it didn't solve my problem really and it was awfully slow. – suayip uzulmez Apr 10 '20 at 17:09
1

I rebuilt your query, I hope I understood your goal correctly. I arrived at the same error. It seems to have something to do with the way SQL evaluates queries. I rephrase your queries as follow:

    qs0 = Topic.objects.filter(
        entries__author=user, entries__date_created__gte=time_threshold(24)).annotate(
            latest=Max("entries__date_created")
        )
    qs1 = qs0.annotate(
        count=Count("entries", filter=Q(entries__date_created__gte=F("latest")))
        ).values("title", "count")

So I first filter out the recent topics where 'user' had entries and annotate them with the date of latest entry (qs0), and then try to annotate that query with the desired count. The first query does what it is supposed to do; when I print it or evaluate it in a list, the results seem correct to me (I used mock data). But with the second query I get the following error message:

aggregate functions are not allowed in FILTER
LINE 1: ...") FILTER (WHERE "dummy_entry"."date_created" >= (MAX("dummy...

Digging on the internet told me that it might have to do with the way SQL handles WHERE. I tried both MySQL and PostgreSQL, both produced errors. To my view the second query is syntactically correct, but because the first query is not evaluated before it is fed into the second, that's how the error occurs.

Anyway, I was able to get the desired result (again, if I understand you correctly), albeit in a very ugly way, by using the following code in place of the second query:

    dict = {}
    for item in qs0:
        dict[item.pk] = [item.title, item.latest, 0]

    for entry in Entry.objects.all():
        if entry.date_created >= dict[entry.topic.pk][1]:
            dict[entry.topic.pk][2] += 1

I put qs0 in a dict with pk as key, and did the count for all the entries manually.

I'm afraid this is the best I can do. I truly hope that someone comes up with a more elegant solution!

EDIT after reading Krysotl's answer:

Not a final answer, but maybe it helps. Most of the times WHERE cannot be used before aggregate functions, see Aggregate function in SQL WHERE-Clause. Sometimes this can be repaired by replacing a WHERE by a HAVING in SQL. Django is able to process raw SQL queries, see https://docs.djangoproject.com/en/3.0/ref/models/expressions/#raw-sql-expressions. So I tried the following:

sql_command = '''SELECT entry.topic_id, topic.title, entry.date_created, COUNT(entry.id) AS id__count FROM entry
        INNER JOIN topic ON (entry.topic_id = topic.id) GROUP BY entry.topic_id, topic.title, entry.date_created
        HAVING entry.date_created > (SELECT MAX(U0.date_created) AS latest
        FROM entry U0 WHERE (U0.author_id = 1 AND U0.date_created >= '2020-04-09 16:31:48.407501+00:00'
        AND U0.topic_id = (entry.topic_id)) GROUP BY U0.topic_id)'''

    qs = Entry.objects.annotate(val=RawSQL(sql_command, ()))

In other words: put the GROUP BY before the WHERE and replace the WHERE with a HAVING. Unfortunately it still gave me errors. I'm afraid I'm not enough of an SQL expert to resolve this, but maybe it is a way forward.

Paul Rene
  • 680
  • 4
  • 14
  • Thank you for your attention. Your solution is way too inefficient to consider, as it iterates over all entries which I think there could be millions of them. I think we can just iterate over topics to get latest__date_created and create another queryset to just filter out entries and get count of them. But currently I'm looking for just 'one' efficient queryset solution. – suayip uzulmez Apr 07 '20 at 13:22