5

Ok, after some searching I can't seem to find a SO question that directly tackles this. I've looked into masked arrays and although they seem cool, I'm not sure if they are what I need.

consider 2 numpy arrays:

zone_data is a 2-d numpy array with clumps of elements with the same value. This is my 'zones'.

value_data is a 2-d numpy array (exact shape of zone_data) with arbitrary values.

I seek a numpy array of same shape as zone_data/value_data that has the average values of each zone in place of the zone numbers.

example...in ascii art form.

zone_data (4 distinct zones):

1, 1, 2, 2
1, 1, 2, 2
3, 3, 4, 4
3, 4, 4, 4

value_data:

1, 2, 3, 6
3, 0, 2, 5
1, 1, 1, 0
2, 4, 2, 1

my result, call it result_data:

1.5, 1.5, 4.0, 4.0
1.5, 1.5, 4.0, 4.0
2.0, 2.0, 1.0, 1.0
2.0, 2.0, 1.0, 1.0

here's the code I have. It works fine as far as giving me a perfect result.

result_data = np.zeros(zone_data.shape)
for i in np.unique(zone_data):
    result_data[zone_data == i] = np.mean(value_data[zone_data == i])

My arrays are big and my code snippet takes several seconds. I think I have a knowledge gap and haven't hit on anything helpful. The loop aspect needs to be delegated to a library or something...aarg!

I seek help to make this FASTER! Python gods, I seek your wisdom!

EDIT -- adding benchmark script

import numpy as np
import time

zones = np.random.randint(1000, size=(2000,1000))
values = np.random.rand(2000,1000)

print 'start method 1:'
start_time = time.time()

result_data = np.zeros(zones.shape)
for i in np.unique(zones):
    result_data[zones == i] = np.mean(values[zones == i])

print 'done method 1 in %.2f seconds' % (time.time() - start_time)

print
print 'start method 2:'
start_time = time.time()

#your method here!

print 'done method 2 in %.2f seconds' % (time.time() - start_time)

my output:

start method 1:
done method 1 in 4.34 seconds

start method 2:
done method 2 in 0.00 seconds
user1269942
  • 3,772
  • 23
  • 33

2 Answers2

3

You could use np.bincount:

count = np.bincount(zones.flat)
tot = np.bincount(zones.flat, weights=values.flat)
avg = tot/count
result_data2 = avg[zones]

which gives me

start method 1:
done method 1 in 3.13 seconds

start method 2:
done method 2 in 0.01 seconds
>>> 
>>> np.allclose(result_data, result_data2)
True
DSM
  • 342,061
  • 65
  • 592
  • 494
  • Excellent use of `bincount`. +1 – Oliver W. Jan 17 '15 at 22:37
  • DSM, that's awesome! I love SO mostly because of the people like yourself who can share some specific knowledge that would have taken me a long time to find myself. Thank you so much! This was not just a trivial exercise...this will open one of the bottle necks I have in an application. Love the "np.allclose" too...what a gem. – user1269942 Jan 19 '15 at 06:25
1

I thought I had seen this in scipy somewhere, but I can't find it anymore. Have you looked there?

Anyway, you can get a first improvement by changing your loop:

result_data = np.empty(zones.shape)  # minor speed gain
for label in np.unique(zones):
    mask = zones==label
    result_data[mask] = np.mean(values[mask])

That way you don't needlessly do the boolean comparison twice. That 'll cut down the execution time a bit.

Oliver W.
  • 13,169
  • 3
  • 37
  • 50
  • that's a good observation. In my case it would save about 40%...which is great and I should have known better...I've done that in many other spots. I will take DSM's answer, however as it's 100+ times faster! – user1269942 Jan 19 '15 at 06:18