2

I try to code the equivalent Django query from this SQL query, but I'm stuck. Any help is welcome. I receive a race id and from this race I want to do some statistics : nb_race = the number of races from a Horse before the race given, best_chrono = best time from a Horse before the race given.

SELECT *, (SELECT count(run.id)
                FROM runner run
                INNER JOIN race
                ON run.race_id = race.id
                WHERE run.horse_id = r.horse_id
                AND race.datetime_start < rc.datetime_start 
                ) AS nb_race, 
          (SELECT min(run.chrono)
                FROM runner run
                INNER JOIN race
                ON run.race_id = race.id
                WHERE run.horse_id = r.horse_id
                AND race.datetime_start < rc.datetime_start 
                ) AS best_time
FROM runner r, race rc
WHERE r.race_id = rc.id
AND rc.id = 7890

Django Models:

class Horse(models.Model):
    id = AutoField(primary_key=True)
    name = models.CharField(max_length=255, blank=True, null=True, default=None)

class Race(models.Model):
    id = AutoField(primary_key=True)
    datetime_start = models.DateTimeField(blank=True, null=True, default=None)
    name = models.CharField(max_length=255, blank=True, null=True, default=None)

class Runner(models.Model):
    id = AutoField(primary_key=True)
    horse = models.ForeignKey(Horse, on_delete=models.PROTECT)
    race = models.ForeignKey(Race, on_delete=models.PROTECT)
    chrono = models.DecimalField(max_digits=10, decimal_places=2, blank=True, null=True, default=None)
hynekcer
  • 14,942
  • 6
  • 61
  • 99
Frederic
  • 115
  • 10
  • You may want to consider running this query via Django's ```raw()``` functionality due to the complexity of the query. Information on the topic can be found here on the Django website.https://docs.djangoproject.com/en/3.0/topics/db/sql/ – Lewis Jun 26 '20 at 19:10
  • I deed, but I don't understand how the " Correlated subquery" works in Django. My last attempt is this : ```runners = Runner.objects.filter(race_id=race.id) \ .annotate( nb_course=RawSQL( "SELECT count(runner.id) FROM runner INNER JOIN race ON race.course_id = race.id WHERE runner.horse_id = %s", (OuterRef('horse_id'))))``` I just can't access to the "horse_id " in the "Runner.objects.filter(race_id=race.id)" – Frederic Jun 26 '20 at 19:57
  • Frederic, what sort of table size are you working with? As if your working with a fairly small dataset where efficiency is not so much of a concern, it might be worth extracting your primary data to python and use pandas to apply the aggregation and querying? Just a suggestion. – Lewis Jun 26 '20 at 20:09
  • My db is consequent: 12 000 000 runners, 990 000 horses. And that's just 2 of more than 20 tables... around 10 GiB for the total. I don't know pandas, I will have a look. – Frederic Jun 26 '20 at 20:14
  • Yeah stick with SQL. I’ll have a go at it tomorrow. – Lewis Jun 26 '20 at 20:17
  • Thank you Lewis. – Frederic Jun 26 '20 at 20:28

1 Answers1

2

Subquery expression can be used to compile an additional queryset as subquery that depends on the main queryset and execute them together as one SQL.

from django.db.models import OuterRef, Subquery, Count, Min, F

# prepare a repeated expression about previous runners, but don't execute it yet
prev_run = (
    Runner.objects
    .filter(
        horse=OuterRef('horse'),
        race__datetime_start__lt=OuterRef('race__datetime_start'))
    .values('horse')
)
queryset = (
    Runner.objects
    .values('id', 'horse_id', 'race_id', 'chrono', 'race__name', 'race__datetime_start')
    .annotate(
        nb_race=Subquery(prev_run.annotate(nb_race=Count('id')).values('nb_race')),
        best_time=Subquery(prev_run.annotate(best_time=Min('chrono')).values('best_time'))
    )
)

Some tricks used here are described in the linked docs:

  • Output fields of the Subquery must be restricted by .values(...) to one field: only the aggregated value
  • A subquery must be a queryset (that is evaluated lazy and combined together), not a value (that would be evaluated immediately and fail). Therefore .annotate() is used in the subquery (not .aggregate()). That adds a GROUP BY race.horse_id, but is is not a problem becase there is also WHERE race.horse_id = ... and the "group by" will be finally ignored by an SQL optimizer in a modern database backend.

It is compiled to a query equivalent to the SQL in the example. Check the SQL:

>>> print(str(queryset.query))
SELECT ...,
  (SELECT COUNT(U0.id)
   FROM runner U0 INNER JOIN race U1 ON (U0.race_id = U1.id)
   WHERE (U0.horse_id = runner.horse_id AND U1.datetime_start < race.datetime_start)
   GROUP BY U0.horse_id
   ) AS nb_race,
   ...
FROM runner INNER JOIN race ON (runner.race_id = race.id)

A marginal difference is that a subquery uses some internal aliases like U0 and U1.

hynekcer
  • 14,942
  • 6
  • 61
  • 99