36

I am using Django 1.9. I have a Django table that represents the value of a particular measure, by organisation by month, with raw values and percentiles:

class MeasureValue(models.Model):
    org = models.ForeignKey(Org, null=True, blank=True)
    month = models.DateField()
    calc_value = models.FloatField(null=True, blank=True)
    percentile = models.FloatField(null=True, blank=True)

There are typically 10,000 or so per month. My question is about whether I can speed up the process of setting values on the models.

Currently, I calculate percentiles by retrieving all the measurevalues for a month using a Django filter query, converting it to a pandas dataframe, and then using scipy's rankdata to set ranks and percentiles. I do this because pandas and rankdata are efficient, able to ignore null values, and able to handle repeated values in the way that I want, so I'm happy with this method:

records = MeasureValue.objects.filter(month=month).values()
df = pd.DataFrame.from_records(records)
// use calc_value to set percentile on each row, using scipy's rankdata

However, I then need to retrieve each percentile value from the dataframe, and set it back onto the model instances. Right now I do this by iterating over the dataframe's rows, and updating each instance:

for i, row in df.iterrows():
    mv = MeasureValue.objects.get(org=row.org, month=month)
    if (row.percentile is None) or np.isnan(row.percentile):
        row.percentile = None
    mv.percentile = row.percentile
    mv.save()

This is unsurprisingly quite slow. Is there any efficient Django way to speed it up, by making a single database write rather than tens of thousands? I have checked the documentation, but can't see one.

Richard
  • 62,943
  • 126
  • 334
  • 542

4 Answers4

53

Atomic transactions can reduce the time spent in the loop:

from django.db import transaction

with transaction.atomic():
    for i, row in df.iterrows():
        mv = MeasureValue.objects.get(org=row.org, month=month)

        if (row.percentile is None) or np.isnan(row.percentile): 
            # if it's already None, why set it to None?
            row.percentile = None

        mv.percentile = row.percentile
        mv.save()

Django’s default behavior is to run in autocommit mode. Each query is immediately committed to the database, unless a transaction is actives.

By using with transaction.atomic() all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.

ahmed
  • 5,430
  • 1
  • 20
  • 36
  • Thanks. Why does this reduce the time? The documentation isn't helping https://docs.djangoproject.com/en/1.9/topics/db/transactions/#controlling-transactions-explicitly – Richard Apr 21 '16 at 15:11
  • 2
    Because the queries are combined into a single large query, that is only executed once. Lets say saving to the database takes 10 ms, if you save 100 objects running autocommit, that takes 10 ms * 100, but if you use a transaction it will only take 10 ms * 1 (rougly) – Peter F Jun 17 '20 at 09:57
  • very help full answer. actually I was looking for something else but got very good alternative – kiran kumar Aug 31 '20 at 13:41
35

As of Django 2.2, you can use the bulk_update() queryset method to efficiently update the given fields on the provided model instances, generally with one query:

objs = [
    Entry.objects.create(headline='Entry 1'),
    Entry.objects.create(headline='Entry 2'),
]
objs[0].headline = 'This is entry 1'
objs[1].headline = 'This is entry 2'
Entry.objects.bulk_update(objs, ['headline'])

In older versions of Django you could use update() with Case/When, e.g.:

from django.db.models import Case, When

Entry.objects.filter(
    pk__in=headlines  # `headlines` is a pk -> headline mapping
).update(
    headline=Case(*[When(pk=entry_pk, then=headline)
                    for entry_pk, headline in headlines.items()]))
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
  • 1
    Thanks for putting the older version in there. I don't have access to `bulk_update` so getting clarity on the `update` call was reassuring. – NeilG Jun 09 '21 at 02:35
  • Is `Case`, `When` faster than `transactions` ? – Ahtisham Feb 04 '22 at 15:52
  • I found more detail on the behind the scenes of this here https://www.reddit.com/r/django/comments/mjpbij/running_a_bulk_update_efficiently_with_django/ – Shlomo Sep 04 '22 at 12:08
1

In my case, we need Value

headlines is a pk -> headline mapping `{1, 'some_val1', 2, 'some_val2', ...}
from django.db.models import Case, When, Value

Entry.objects.filter(
pk__in=headlines  
).update(
headline=Case(*[When(pk=entry_pk, then=Value(headline))
                for entry_pk, headline in headlines.items()]))
cconsta1
  • 737
  • 1
  • 6
  • 20
fromSelect
  • 23
  • 4
0

Actually, attempting @Eugene Yarmash 's answer I found I got this error:

FieldError: Joined field references are not permitted in this query

But I believe iterating update is still quicker than multiple saves, and I expect using a transaction should also expedite.

So, for versions of Django that don't offer bulk_update, assuming the same data used in Eugene's answer, where headlines is a pk -> headline mapping:

from django.db import transaction

with transaction.atomic():
    for entry_pk, headline in headlines.items():
        Entry.objects.filter(pk=entry_pk).update(headline=headline)
NeilG
  • 3,886
  • 2
  • 22
  • 30