Optimizing Django Aggregation Over Subsets

Question

I'm currently trying to do some summary statistics calculations for date-based subsets of a large "competition" table (~3M rows) stored in SQLite. Specifically, I'm trying to calculate statistics for:

this year
last year
the lifetime of the competitor

Here's a model breakdown:

class Competitor(models.Model):
    # ... ID is the only important part here

class Venue(models.Model):
    # ... ID is the only important part here

class Division(models.Model):
    venue = models.ForeignKey(Venue)
    # ...

class Level(models.Model):
    division = models.ForeignKey(Division)
    # ...


class Class(models.TextChoices):
    STANDARD = "Standard", _("Standard")
    JWW = "JWW", _("JWW")


class Run(models.Model):
    competitor_id = models.ForeignKey(Competitor, related_name="runs", db_index=True)
    date = models.DateField(verbose_name="Date created", db_index=True)
    MACH = models.IntegerField(..., db_index=True)
    PACH = models.IntegerField(..., db_index=True)
    yps = models.FloatField(..., db_index=True)
    score = models.IntegerField(..., db_index=True)
    qualified = models.BooleanField(..., db_index=True)
    division = models.ForeignKey(Division, db_index=True)
    level = models.ForeignKey(Level, db_index=True)
    cls = models.CharField(max_length=..., choices=Class.choices)
    # ... Other fields that aren't relevant

For each Competitor, I want to generate summary statistics that describe their performance this year, last year, and over all time, and store that in a Report model:

class CurrPrevLifetime(models.Model):
    curr_yr = models.FloatField(default=0)
    prev_yr = models.FloatField(default=0)
    lifetime = models.FloatField(default=0)

class Report(models.Model):
    ... = models.ForeignKey(CurrPrevLifetime, related_name=...)
    # repeat as needed for as many fields need this

My current aggregation setup looks like this:

curr_yr = Q(date__year=datetime.date.today().year)
prev_yr = Q(date__year=datetime.date.today().year-1)

JWW = Q(cls=Class.JWW)
standard = Q(cls=Class.STANDARD)

aggregates = {
    "curr_yr_max_standard_MACH": Max("MACH", filter=curr_yr & standard),
    "curr_yr_max_standard_PACH": Max("PACH", filter=curr_yr & standard),
    "curr_yr_average_yps_standard": Avg("yps", filter=curr_yr & standard),
    "curr_yr_max_yps_standard": Max("yps", filter=curr_yr & standard),
    "curr_yr_max_JWW_MACH": Max("MACH", filter=curr_yr & JWW),
    "curr_yr_max_JWW_PACH": Max("PACH", filter=curr_yr & JWW),
    "curr_yr_average_yps_JWW": Avg("yps", filter=curr_yr & JWW),
    "curr_yr_max_yps_JWW": Max("yps", filter=curr_yr & JWW),
    "curr_yr_MACH_points": Sum("MACH", filter=curr_yr),
    "curr_yr_PACH_points": Sum("PACH", filter=curr_yr),

    "prev_yr_max_standard_MACH": Max("MACH", filter=prev_yr & standard),
    "prev_yr_max_standard_PACH": Max("PACH", filter=prev_yr & standard),
    "prev_yr_average_yps_standard": Avg("yps", filter=prev_yr & standard),
    "prev_yr_max_yps_standard": Max("yps", filter=prev_yr & standard),
    "prev_yr_max_JWW_MACH": Max("MACH", filter=prev_yr & JWW),
    "prev_yr_max_JWW_PACH": Max("PACH", filter=prev_yr & JWW),
    "prev_yr_average_yps_JWW": Avg("yps", filter=prev_yr & JWW),
    "prev_yr_max_yps_JWW": Max("yps", filter=prev_yr & JWW),
    "prev_yr_MACH_points": Sum("MACH", filter=prev_yr),
    "prev_yr_PACH_points": Sum("PACH", filter=prev_yr),

    "lifetime_max_standard_MACH": Max("MACH", filter=standard),
    "lifetime_max_standard_PACH": Max("PACH", filter=standard),
    "lifetime_average_yps_standard": Avg("yps", filter=standard),
    "lifetime_max_yps_standard": Max("yps", filter=standard),
    "lifetime_max_JWW_MACH": Max("MACH", filter=JWW),
    "lifetime_max_JWW_PACH": Max("PACH", filter=JWW),
    "lifetime_average_yps_JWW": Avg("yps", filter=JWW),
    "lifetime_max_yps_JWW": Max("yps", filter=JWW),
    "lifetime_MACH_points": Sum("MACH"),
    "lifetime_PACH_points": Sum("PACH"),
}
competitor.runs.aggregate(**aggregates)

I then take the results, break them into triples of (curr, prev, lifetime), and store those as CurrPrevLifetime using CurrPrevLifetime.objects.bulk_create.

This aggregation takes a surprisingly long time to complete, given that the operations are maxing, averaging, and summing (~5s). For reference, a competitor has somewhere in the vicinity of ~500 runs to their name over their lifetime.

What is to blame for the performance hit? Indexes are applied to the computed fields and the relations between the models, I've tried going the "filter then aggregate" route, with no noticeable difference and a significant increase in LOC written.

With 40,000 competitors, 5 seconds per report is going to take too long. How do I speed up this process? I'm happy to restructure if necessary.

I've tried tackling this a number of ways:

Breaking the competitor.runs into curr_yr_runs, prev_yr_runs, and lifetime_runs, and performing the aggregates individually (poor performance).
Breaking them up by cls into JWW_runs and standard_runs, and performing the aggregates by date that way (poor performance).

HAve you tried using django debug toolbar on the view to see where the time might be being consumed? — Swift, Oct 25 '22 at 18:12
This is being run in a management command right now for testing. — Quontas, Oct 25 '22 at 18:39
Understood, then that comment would not apply, unfortunately. — Swift, Oct 25 '22 at 19:43
Just a random idea, if you can fathom a solution in your use case, could you load the data incrementally for the user to see, or in your case, incrementally for the management command? — Swift, Oct 25 '22 at 19:45

Optimizing Django Aggregation Over Subsets

0 Answers0