I'm currently trying to do some summary statistics calculations for date-based subsets of a large "competition" table (~3M rows) stored in SQLite. Specifically, I'm trying to calculate statistics for:
- this year
- last year
- the lifetime of the competitor
Here's a model breakdown:
class Competitor(models.Model):
# ... ID is the only important part here
class Venue(models.Model):
# ... ID is the only important part here
class Division(models.Model):
venue = models.ForeignKey(Venue)
# ...
class Level(models.Model):
division = models.ForeignKey(Division)
# ...
class Class(models.TextChoices):
STANDARD = "Standard", _("Standard")
JWW = "JWW", _("JWW")
class Run(models.Model):
competitor_id = models.ForeignKey(Competitor, related_name="runs", db_index=True)
date = models.DateField(verbose_name="Date created", db_index=True)
MACH = models.IntegerField(..., db_index=True)
PACH = models.IntegerField(..., db_index=True)
yps = models.FloatField(..., db_index=True)
score = models.IntegerField(..., db_index=True)
qualified = models.BooleanField(..., db_index=True)
division = models.ForeignKey(Division, db_index=True)
level = models.ForeignKey(Level, db_index=True)
cls = models.CharField(max_length=..., choices=Class.choices)
# ... Other fields that aren't relevant
For each Competitor
, I want to generate summary statistics that describe their performance this year, last year, and over all time, and store that in a Report
model:
class CurrPrevLifetime(models.Model):
curr_yr = models.FloatField(default=0)
prev_yr = models.FloatField(default=0)
lifetime = models.FloatField(default=0)
class Report(models.Model):
... = models.ForeignKey(CurrPrevLifetime, related_name=...)
# repeat as needed for as many fields need this
My current aggregation setup looks like this:
curr_yr = Q(date__year=datetime.date.today().year)
prev_yr = Q(date__year=datetime.date.today().year-1)
JWW = Q(cls=Class.JWW)
standard = Q(cls=Class.STANDARD)
aggregates = {
"curr_yr_max_standard_MACH": Max("MACH", filter=curr_yr & standard),
"curr_yr_max_standard_PACH": Max("PACH", filter=curr_yr & standard),
"curr_yr_average_yps_standard": Avg("yps", filter=curr_yr & standard),
"curr_yr_max_yps_standard": Max("yps", filter=curr_yr & standard),
"curr_yr_max_JWW_MACH": Max("MACH", filter=curr_yr & JWW),
"curr_yr_max_JWW_PACH": Max("PACH", filter=curr_yr & JWW),
"curr_yr_average_yps_JWW": Avg("yps", filter=curr_yr & JWW),
"curr_yr_max_yps_JWW": Max("yps", filter=curr_yr & JWW),
"curr_yr_MACH_points": Sum("MACH", filter=curr_yr),
"curr_yr_PACH_points": Sum("PACH", filter=curr_yr),
"prev_yr_max_standard_MACH": Max("MACH", filter=prev_yr & standard),
"prev_yr_max_standard_PACH": Max("PACH", filter=prev_yr & standard),
"prev_yr_average_yps_standard": Avg("yps", filter=prev_yr & standard),
"prev_yr_max_yps_standard": Max("yps", filter=prev_yr & standard),
"prev_yr_max_JWW_MACH": Max("MACH", filter=prev_yr & JWW),
"prev_yr_max_JWW_PACH": Max("PACH", filter=prev_yr & JWW),
"prev_yr_average_yps_JWW": Avg("yps", filter=prev_yr & JWW),
"prev_yr_max_yps_JWW": Max("yps", filter=prev_yr & JWW),
"prev_yr_MACH_points": Sum("MACH", filter=prev_yr),
"prev_yr_PACH_points": Sum("PACH", filter=prev_yr),
"lifetime_max_standard_MACH": Max("MACH", filter=standard),
"lifetime_max_standard_PACH": Max("PACH", filter=standard),
"lifetime_average_yps_standard": Avg("yps", filter=standard),
"lifetime_max_yps_standard": Max("yps", filter=standard),
"lifetime_max_JWW_MACH": Max("MACH", filter=JWW),
"lifetime_max_JWW_PACH": Max("PACH", filter=JWW),
"lifetime_average_yps_JWW": Avg("yps", filter=JWW),
"lifetime_max_yps_JWW": Max("yps", filter=JWW),
"lifetime_MACH_points": Sum("MACH"),
"lifetime_PACH_points": Sum("PACH"),
}
competitor.runs.aggregate(**aggregates)
I then take the results, break them into triples of (curr, prev, lifetime)
, and store those as CurrPrevLifetime
using CurrPrevLifetime.objects.bulk_create
.
This aggregation takes a surprisingly long time to complete, given that the operations are maxing, averaging, and summing (~5s). For reference, a competitor has somewhere in the vicinity of ~500 runs to their name over their lifetime.
What is to blame for the performance hit? Indexes are applied to the computed fields and the relations between the models, I've tried going the "filter then aggregate" route, with no noticeable difference and a significant increase in LOC written.
With 40,000 competitors, 5 seconds per report is going to take too long. How do I speed up this process? I'm happy to restructure if necessary.
I've tried tackling this a number of ways:
Breaking the
competitor.runs
intocurr_yr_runs
,prev_yr_runs
, andlifetime_runs
, and performing the aggregates individually (poor performance).Breaking them up by
cls
intoJWW_runs
andstandard_runs
, and performing the aggregates by date that way (poor performance).