1

I have a Django model that contains a unique record with a date . I'm currently counting records into ranges of days, e.g. X Number have already passed todays date, X will happen within the next 10 days, X will happen within the next 30 days. The code below is what I am currently using, it pulls all values back from an records.objects.all() query against the model and then loops through each object to calculate the datetime delta and increment the relevant counter.

for x in records:
    if x.date is None:
        missingValue += 1
    else:
        delta = x.date - date.today()
        if delta.days < 0:
            passed += 1
        if delta.days < 10:
            tenDays += 1
        if delta.days < 30:
            thirtyDays += 1

For around 50,000 records this takes about 5-6 seconds which is longer than I would like, I'm trying to reduce this as the number of records is likely to increase. The question is really around the performant calculation of datetime diffs and grouping the resultant number of days as if there is a better method through a Django Query or other method I've not been able to find I'm open to trying it.

I've explored the use of DateAdd in a raw SQL but it would appear to need me to query the database for each date range and would still lead to me needing to loop through the results.

Draineh
  • 599
  • 3
  • 11
  • 28
  • Please prepare http://www.rextester.com demo with sample data and desired output. I am sure that in pure SQL we could get times below 1 sec. – Lukasz Szozda Oct 21 '17 at 14:30
  • 1
    http://rextester.com/AWPX46055 - Think I've done that right. Desired output is just the return of numbers by the groups above, passed, within ten days, within thirty days. I also see a future need for anything over 30 days. The within dates are cumulative. – Draineh Oct 21 '17 at 15:02
  • 1
    Thanks for sample data – Lukasz Szozda Oct 21 '17 at 15:30

2 Answers2

1

Before optimizing performance I would consider batch-wise execution. Your smallest window-of-change appears to be 1 day. So by filtering on a 'updated' field in the record model, you can call every hour (per cron) an:

from datetime import datetime, timedelta
records.objects.filter(updated__lt = datetime.now()-timedelta(days=1))[:2083]

and do your operation. Note you can cap the number of records retrieved. So every hour you will get 2083 (or 5000) records processed dividing the task over the day. This number you can scale based on the number of records in the database (e.g 50000/24 = 2083)

Also your migration might want to reflect that you want to set it far in the past so every live record gets processed once in the beginning.

rollinger
  • 531
  • 1
  • 4
  • 15
  • You're right that the smallest window is 1 day. At the moment I'm taking the results and presenting them directly to users (part of the issue). With this approach I guess I could cached the result and call that each time a user views the figures rather than calculating them on the fly – Draineh Oct 21 '17 at 15:05
  • 1
    Yes, the point here is, wherever you find yourself writing `Model.objects.all()` you should think about filtering, to reduce the stress on the database. If you can find other ways of filtering, thats good. E.G. when a user does an action, update all records associated with that user. The batch-every-hour is the exhaustive solution and avoids forgotten records. – rollinger Oct 21 '17 at 15:14
1

Using SQL windowed COUNT:

WITH cte AS (
     SELECT *,CASE WHEN DATEDIFF(DAY,GETDATE(),targetdate) <=0  THEN 0
                   WHEN DATEDIFF(DAY,GETDATE(),targetdate) <=10 THEN 10
                   WHEN DATEDIFF(DAY,GETDATE(),targetdate) <=30 THEN 30
                   ELSE 31 END AS grp
     FROM [record]   
     --WHERE targetdate > GETDATE() - 60  -- last 60 days
)
SELECT DISTINCT grp, COUNT(*) OVER(ORDER BY grp) AS running_count
FROM cte;

Rextester Demo

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275