Efficiency with very large numpy arrays

Question

I'm working with some very large arrays. An issue that I'm dealing with of course is running out of RAM to work with, but even before that my code is running slowly so that, even if I had infinite RAM, it would still take way too long. I'll give a bit of my code to show what I'm trying to do:

#samplez is a 3 million element 1-D array
#zfit is a 10,000 x 500 2-D array

b = np.arange((len(zfit))

for x in samplez:
    a = x-zfit
    mask = np.ma.masked_array(a)
    mask[a <= 0] = np.ma.masked
    index = mask.argmin(axis=1)
    #  These past 4 lines give me an index array of the smallest positive number 
    #  in x - zift       

    d = zfit[b,index]
    e = zfit[b,index+1]
    f = (x-d)/(e-d)
    # f is the calculation I am after

    if x == samplez[0]:
       g = f
       index_stack = index
    else:
       g = np.vstack((g,f))
       index_stack = np.vstack((index_stack,index))

I need to use g and index_stack, each of which are 3million x 10,000 2-D arrays, in a further calculation. Each iteration of this loop takes almost 1 second, so 3 million seconds total, which is way too long.

Is there anything I can do so that this calculation will run much faster? I've tried to think how I can do without this for loop, but the only way I can imagine is making 3 million copies of zfit, which is unfeasible.

And is there someway I can work with these arrays by not keeping everything in RAM? I'm a beginner and everything I've searched about this is either irrelevant or something I can't understand. Thanks in advance.

Is there any duplicated value in samplez? Or it only contains unique values ? — CT Zhu, Sep 29 '13 at 02:55
There is a potential problem in `e = zfit[b,index+1]`. If the smallest positive value of is the last element, in any row of the array, `[b,index+1]` will cause an `IndexError` (out of range). And the first line should be `b = np.arange(len(zfit))` — CT Zhu, Sep 29 '13 at 05:30
Thanks for the comment. Due to reasons not very relevant to the problem, the smallest positive number will never be the last element of any row in a. Thus the index error is not an issue, though your right that in general it would be a consideration. The first line was a typo, thanks. — cracka31, Sep 29 '13 at 06:32
And the maximum value in each row of `zfit` is larger than the maximum value in `samplez`? — CT Zhu, Sep 29 '13 at 07:16
Yes the max value of zfit in each row is always greater then max of samplez — cracka31, Sep 29 '13 at 15:55

CT Zhu · Answer 1 · 2013-10-01T04:36:08.417

It is good to know that the smallest positive number will never show up in the end of rows.

In samplez there are 1 million unique values but in zfit, each row can only have 500 unique values at most. The entire zfit can have as much as 50 million unique values. The algorithm can be greatly sped up, if the number of 'finding the smallest positive number > each_element_in_samplez' calculation can be greatly reduced. Doing all 5e13 comparisons are probably an overkill and careful planing will be able to get rid of a large proportion of it. That will heavy depend on your actual underlying mathematics.

Without knowing it, there are still some small things can be done. 1, there are not so many of possible (e-d) so that can be taken out of the loop. 2, The loop can be eliminated by map. These two small fix, on my machine, result in about 22% speed-up.

def function_map(samplez, zfit):
    diff=zfit[:,:-1]-zfit[:,1:]
    def _fuc1(x):
        a = x-zfit
        mask = np.ma.masked_array(a)
        mask[a <= 0] = np.ma.masked
        index = mask.argmin(axis=1)
        d = zfit[:,index]
        f = (x-d)/diff[:,index] #constrain: smallest value never at the very end.
        return (index, f)
    result=map(_fuc1, samplez)
    return (np.array([item[1] for item in result]),
           np.array([item[0] for item in result]))

Next: masked_array can be avoided completely (which should bring significant improvement). samplez needs to be sorted as well.

>>> x1=arange(50)
>>> x2=random.random(size=(20, 10))*120
>>> x2=sort(x2, axis=1) #just to make sure the last elements of each col > largest val in x1
>>> x3=x2*1
>>> f1=lambda: function_map2(x1,x3)
>>> f0=lambda: function_map(x1, x2)
>>> def function_map2(samplez, zfit):
    _diff=diff(zfit, axis=1)
    _zfit=zfit*1
    def _fuc1(x):
        _zfit[_zfit<x]=(+inf)
        index = nanargmin(zfit, axis=1)
        d = zfit[:,index]
        f = (x-d)/_diff[:,index] #constrain: smallest value never at the very end.
        return (index, f)
    result=map(_fuc1, samplez)
    return (np.array([item[1] for item in result]),
           np.array([item[0] for item in result]))

>>> import timeit
>>> t1=timeit.Timer('f1()', 'from __main__ import f1')
>>> t0=timeit.Timer('f0()', 'from __main__ import f0')
>>> t0.timeit(5)
0.09083795547485352
>>> t1.timeit(5)
0.05301499366760254
>>> t0.timeit(50)
0.8838210105895996
>>> t1.timeit(50)
0.5063929557800293
>>> t0.timeit(500)
8.900799036026001
>>> t1.timeit(500)
4.614129018783569

So, that is another 50% speed-up.

masked_array is avoided and that saves some RAM. Can't think of anything else to reduce RAM usage. It may be necessary to process samplez in parts. And also, dependents on the data and the required precision, if you can use float16 or float32 instead of the default float64 that can save you a lot of RAM.

Note that in python3 `map` returns an iterator, hence the `return` statement will fail since the second array will always be empty. Also, please avoid using `\ ` as continuation. Simply wrap the return value in parenthesis. — Bakuriu, Sep 29 '13 at 07:28
Hi CT. When you run this on your machine are you using the same size arrays as I am? Your machine can handle this? — cracka31, Sep 29 '13 at 16:21
No, much smaller size just to benchmark the speed. 3e10 `float64` or `int64` takes more than 200gb. http://www.wolframalpha.com/input/?i=3e10+*+8+bytes. I don't have that capacity. — CT Zhu, Sep 29 '13 at 16:48
And you say the underlying mathematics might allow for more simplifications. samplez contains unique values from .08 to 1.1, and each row of zfit contains unique values from 0 to 2 in increasing order. For each element in samplez, I'm trying to find the number in each row of zfit that it's most closely greater to, and then find the fraction between it and the next element in zfit. So if the element in samplez was .5, and a row in zfit was: .3, .4, .6, .7, I need the fraction (.5-.4)/(.6-.4). Hope thats not too confusing — cracka31, Sep 29 '13 at 16:53
Yes I know that this many elements takes up around 240gb... that's why I'm hoping for someone also to suggest a memory solution as well :) — cracka31, Sep 29 '13 at 16:59

Efficiency with very large numpy arrays

1 Answers1