I am using Django 1.9. I have a Django table that represents the value of a particular measure, by organisation by month, with raw values and percentiles:
class MeasureValue(models.Model):
org = models.ForeignKey(Org, null=True, blank=True)
month = models.DateField()
calc_value = models.FloatField(null=True, blank=True)
percentile = models.FloatField(null=True, blank=True)
There are typically 10,000 or so per month. My question is about whether I can speed up the process of setting values on the models.
Currently, I calculate percentiles by retrieving all the measurevalues for a month using a Django filter query, converting it to a pandas dataframe, and then using scipy's rankdata
to set ranks and percentiles. I do this because pandas and rankdata
are efficient, able to ignore null values, and able to handle repeated values in the way that I want, so I'm happy with this method:
records = MeasureValue.objects.filter(month=month).values()
df = pd.DataFrame.from_records(records)
// use calc_value to set percentile on each row, using scipy's rankdata
However, I then need to retrieve each percentile value from the dataframe, and set it back onto the model instances. Right now I do this by iterating over the dataframe's rows, and updating each instance:
for i, row in df.iterrows():
mv = MeasureValue.objects.get(org=row.org, month=month)
if (row.percentile is None) or np.isnan(row.percentile):
row.percentile = None
mv.percentile = row.percentile
mv.save()
This is unsurprisingly quite slow. Is there any efficient Django way to speed it up, by making a single database write rather than tens of thousands? I have checked the documentation, but can't see one.