4

I have two 2d arrays of equal shapes: given_array and reference_array. I have to write a file for each unique value of the reference_array computing mean values where the unique value are in the given array.

import numpy as np

given_array = np.array([[2,4,5,8,9,11,15],[1,2,3,4,5,6,7]])

reference_array = np.array([[2,2,2,8,8,8,15],[2,2,2,4,8,8,9]])

unique_value = np.unique(reference_array)

file_out = open('file_out', 'w')

for unique in unique_value:
    index = reference_array == unique
    mean = np.mean(given_array[index])
    file_out.write(str(unique) + ',' + str(mean) + '\n')

file_out.close()

The above code works, but in my real problem two arrays are extremely large as read from raster image, and it is taking few days to complete the processing.

Would be grateful if someone could provide fastest way of producing the same result.

Borys
  • 1,323
  • 6
  • 16
  • 40

2 Answers2

5

Going only once through the arrays might be faster, even it uses pure python:

from collections import defaultdict
from itertools import izip

add = lambda (sum_, count), value: (sum_+value, count+1)
unique = defaultdict(lambda:(0,0))
for ref, value in izip(reference_array.flat, given_array.flat):
    unique[ref] = add(unique[ref], float(value))

with open('file.out', 'w') as out:
    for ref, (sum_, count) in unique.iteritems():
        out.write('%f,%f\n' % (ref, sum_ / count))

In contrast to the solution of the OP, finding the unique values and calculating the mean values is done in one loop. unique is a dictionary where the key is one reference value and the value is a pair of the sum and count of all the given values which have the same reference value. After the loop, not only all unique reference values are put into the dictionary unique but also all given elements are sorted to their reference value as sum and count, which can easily used to calculate the mean value in a second step.

The complexity of the problem was reduced from size_of_array * number_of_unique_values to size_of_array + number_of_unique_values.

Daniel
  • 42,087
  • 4
  • 55
  • 81
  • thanks, upvoted, will test soon. but is not it necessary to use numpy for faster performance? – Borys Nov 23 '14 at 10:09
  • sorry there was writing mistake in my question CORRECT ONE is: index = reference_array == unique Does your code still works? – Borys Nov 23 '14 at 11:01
  • this was a obvious mistake, I even don't need unique for my solution. – Daniel Nov 23 '14 at 11:10
  • reading large lists in the memory melts down computer, this code must be changed so that the results are written to disk once computed. – Borys Nov 23 '14 at 12:05
  • how large are your arrays, and how many unique values are there? – Daniel Nov 23 '14 at 13:22
  • So i'm correct if your `given_array` as a size of 150GB and reference_array of more than 1TB? If you get these arrays into memory, you should have no problem with memory with the second example, the `unique`-table uses only a few hundred megabytes. Btw, you could only read portions of the arrays into memory. – Daniel Nov 23 '14 at 15:56
  • Nope, I tried with the second array and yes PC does not working. I'm sorry the size of both arrays are not so large, they are 40000, 40000. – Borys Nov 23 '14 at 16:00
  • Correction: Both array have the size of 40000, 40000. The unique value of reference array are 1 to 1600000. The given array is in Byte (0 to 255) – Borys Nov 23 '14 at 16:02
  • THere is no problem until reading out both the two arrays, but later on computing the mean of unique indices, the computer really hangs out. IS not it possible to write the results one by one immediately after computing into the disk? – Borys Nov 23 '14 at 16:06
  • With your sizes i've done a timing test and it says, it needs less than two hours. The advantage of my algorithm is, that it needs only **one** run through the data, to get all mean values, and not up to 1600000 runs. – Daniel Nov 23 '14 at 16:22
  • whoops, I used zip rather than izip before, and the PC hanged out. Now corrected and running smoothly. Let's see the result. I have accepted your answer, and would be happy if you could explain little bit about your code, the second one. – Borys Nov 23 '14 at 16:32
  • Yes, I checked the results, and it works correctly. Would e grateful if you explain how it works. – Borys Nov 23 '14 at 16:41
  • upvoted for the advancement of the code. but it's very difficult to be understood by the newbie, could you please elaborate it more: especially, add, lambda – Roman Nov 23 '14 at 17:22
  • I got RuntimeWarning: add = lambda (a,b), c: (a+c, b+1) overflow encountered in long_scalars. Is it important? – Borys Nov 23 '14 at 17:41
  • @simen: oh, yes, that's bad. You're using `numpy.int`s which can overflow. I've now converted the values to float to circumvent the problem. – Daniel Nov 23 '14 at 17:56
5

You can get the whole thing to work in numpy using unique and bincount. Since numpy's unique uses sorting, it is going to have linearithmic complexity, but it typically beats pure Python code using dictionaries, despite the linear complexity.

If you are using numpy 1.9 or newer:

>>> unq, inv, cnts = np.unique(reference_array, return_inverse=True,
...                            return_counts=True)
>>> means = np.bincount(inv, weights=given_array.ravel()) / cnts

>>> unq
array([ 2,  4,  8,  9, 15])
>>> means
array([  2.83333333,   4.        ,   7.8       ,   7.        ,  15.        ])

With older numpy's it would be marginally slower, but you would do something like:

>>> unq, inv = np.unique(reference_array, return_inverse=True)
>>> cnts = np.bincount(inv)
>>> means = np.bincount(inv, weights=given_array.ravel()) / cnts

EDIT

For more elaborate operations you will need to replicate the guts of np.unique. First, sort both flattened arrays based on the contents of reference_array:

>>> sort_idx = np.argsort(reference_array, axis=None)
>>> given_sort = given_array.ravel()[sort_idx]
>>> ref_sort = reference_array.ravel()[sort_idx]

Then count the number of items in each group:

>>> first_mask = np.concatenate(([True], ref_sort[:-1] != ref_sort[1:]))
>>> first_idx, = np.nonzero(first_mask)
>>> cnts = np.diff(np.concatenate((first_idx, [ref_sort.size])))
>>> cnts
array([6, 1, 5, 1, 1])
>>> unq = ref_sort[first_mask]
>>> unq
array([ 2,  4,  8,  9, 15])

Finally, compute your group calculation, using ufuncs and their reduceat method, e.g. for the group max:

>>> np.maximum.reduceat(given_sort, first_idx)
array([ 5,  4, 11,  7, 15])
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • Superb, this is what I was exactly looking for. I would be more happy if you provide more general way of doing it so that I could use it not only for mean but also for standard deviation, I want to use np.mean, np.stdev. Is it possible with out compromising most of the performance? – Borys Nov 24 '14 at 02:10
  • I am using numpy 1.9. – Borys Nov 24 '14 at 02:16
  • It is not a very numerically stable approach, but you can compute the standard deviation from the sum of the values in each group, `sx = np.bincount(inv, weights=given_array.ravel())` and the sum of their squares, `sx2 = np.bincount(inv, weights=given_array.ravel()**2)`, see e.g. [this answer to a similar question](http://stackoverflow.com/questions/15477857/mean-values-depending-on-binning-with-respect-to-second-variable/15478137#15478137) for details. – Jaime Nov 24 '14 at 03:40
  • I am looking for more general way so that I could use np.max() as well – Borys Nov 24 '14 at 05:25