Aggregation and filtering in Django

Question

I have this table:

**ID    val1     val2**
1      5         6
2      6         4
3      3         1
4      8         4
5      2         6
6      8         2

Using filter query in Django, I want to summarize these data so that I get the mean values of val1 and val2 on each n records. For example if n=3 then:

**ID     val1     val2**
1      4.7      3.7
2      6.0      4.0

score 0 · Answer 1 · answered Oct 22 '14 at 04:59

ind = 0
v1ofn = 0
v2ofn = 0
for row in tname.objects.all():
    if ind >= n:
        result_list.append([v1ofn/3, v2ofn/3])
        v1ofn = row.val1
        v2ofn = row.val2
        ind = 0
    else:
        v1ofn = v1ofn + row.val1
        v2ofn = v2ofn + row.val2
    ind = ind + 1

assuming the table has multiple of 3 items, if not, do some extra logic after loop to handle the extras.

e-nouri · Answer 2 · 2014-10-22T09:04:02.710

0

You have to define your models first, then you can use Count, Min, Sum, Avg from django.db.models that will let you do:

Table.objects.aggregate(average_val1=Avg('table__val1'))

The result dictionary will have a key called 'average_val1'. If no such alias were specified, it would be the rather long 'table__val1__avg'.

More documentation can be found here: Django documentation - Aggregation

NB: You can filter then use the aggregation to execute the operations on a specific set. Egg:

Table.objects.filter( pk__in=id_list ).aggregate(average_val1=Avg('table__val1'))

or you can determine each time the limit ids [id1, id2] for each n then do this:

Table.objects.filter( pk__lte=n1, pk__gte=n2).aggregate(average_val1=Avg('table__val1'))

The __in, __lte, __gte will make sure to filter only the id set you want, then you can aggregate on that set.

__in: in a list, __lte: Less than or equal and __gte: Greater than or equal.

edited Oct 22 '14 at 09:04

answered Oct 22 '14 at 06:59

e-nouri

2,576
1
21
36

I want aggregation according to each n records. I have not a field for grouping. – mrbf Oct 22 '14 at 07:04
You can filter then aggregate ;) ! – e-nouri Oct 22 '14 at 07:10
Thank you. But suppose I have a table with 20000 records and n = 4. Then I have to run "Table.objects.filter" 5000 times and each time id_list must change with new IDs.! – mrbf Oct 22 '14 at 08:47
I see what you mean, I can't think of a way to do this in one line in Django, you can use the `Table.objects.all().extra()` API to run SQL directly ! or you can just use the `Less/Greater than or equal` to avoid listing the ids. – e-nouri Oct 22 '14 at 09:23

score 0 · Answer 3 · answered Oct 22 '14 at 09:47

I guess Aggregation is the Django way to do this, but suggested example would produce shitload of queries as Behesti said. And my guess is that that django ORM is really not built for number crunching (but i might be wrong here!)

I would maybe go with numpy (if you are really having huge array, i guess you need to do some partitioning):

The good sides using numpy is that its usually quite much faster than 'standard' python operations, but bad side is that its extra dependency.

import numpy
raw_array = [ # This is for testing the code, use .values_list( 'val1', 'val2' ) with db
[1   ,   5      ,   6],
[2   ,   6      ,   4],
[3   ,   3      ,   1],
[4   ,   8      ,   4],
[5   ,   2      ,   6],
[6   ,   8      ,   2],
[7   ,   1      ,   1],
]

arr = numpy.array( raw_array )

def sum_up_every_n_items( arr, n ):
   res = numpy.zeros( (numpy.floor( arr.shape[0]/float(n) ), arr.shape[1]) )
   arr = arr[ 0:res.shape[0]*n, : ] # Truncate, take only full N items
   for loop in range(0,n): # Note: this is loop 0,1,2 if n=3 ! We do addition with numpy vectors!
      res = res + arr[ loop::n, : ] # Get every n'th row from starting with offset
   res = res / float(n)
   return res 

res = sum_up_every_n_items( arr, n=3 )
print res

outputs

[[ 2.          4.66666667  3.66666667]
 [ 5.          6.          4.        ]]

I don't think this is a good idea, because if you really have huge tables you will be consuming a lot of RAM !! — e-nouri, Oct 22 '14 at 10:00
You get best performance by consuming ram. As i wrote, if you have HUGE datasets you need to partition the calculation (do the .values_list from filtered queryset). — susundberg, Oct 22 '14 at 10:21
in a HUGE datasets best is to optimize your algorithm and do batch processing, sure it will take a bit more time but at list you won't run out of RAM. — e-nouri, Oct 22 '14 at 10:33
I am not quite following you. Just put single for loop on top of that .values_list with say 2**30 elements and you surely wont run out of ram AND its optimized in the sense that it will use native code for the calculation. — susundberg, Oct 22 '14 at 10:43

score 0 · Answer 4 · answered Oct 22 '14 at 10:04

Avoid doing many queries. Pull the data down one time and do the rest in Python;

n = 3 #yourstep
results = Model.objects.filter(query).only('id', 'val1', 'val2')
avgs = {}
for i in xrange(0, len(results), n): # unsure, but count may end up in 2 queries, and you're going to evaluate the queryset anyway
    avg1, avg2 = 0, 0
    for j in xrange(n):
        avg1 += results[i+j].val1/float(n)
        avg2 += results[i+j].val2/float(n)
    avgs[results[i].id] = (avg1, avg2) # Why is id relevant at all here?

Thank you Martol1ni. I think your solution is logical.I will test it. — mrbf, Oct 22 '14 at 10:45

Aggregation and filtering in Django

4 Answers4