30

I am baffled by this

def main():
    for i in xrange(2560000):
        a = [0.0, 0.0, 0.0]

main()

$ time python test.py

real     0m0.793s

Let's now see with numpy:

import numpy

def main():
    for i in xrange(2560000):
        a = numpy.array([0.0, 0.0, 0.0])

main()

$ time python test.py

real    0m39.338s

Holy CPU cycles batman!

Using numpy.zeros(3) improves, but still not enough IMHO

$ time python test.py

real    0m5.610s
user    0m5.449s
sys 0m0.070s

numpy.version.version = '1.5.1'

If you are wondering if the list creation is skipped for optimization in the first example, it is not:

  5          19 LOAD_CONST               2 (0.0)
             22 LOAD_CONST               2 (0.0)
             25 LOAD_CONST               2 (0.0)
             28 BUILD_LIST               3
             31 STORE_FAST               1 (a)
Stefano Borini
  • 138,652
  • 96
  • 297
  • 431
  • 2
    A quick thought: `numpy.array` is actually a more complex data structure than a list. And in the second snippet, you create a list **and** an numpy array (in the first only a list). Whether this is the only reason for such a big difference, I cannot say. – Felix Kling Jul 02 '11 at 20:38
  • @Felix: ok, but the creation of the list is fast, so even if I create a list and a numpy array in the second case, it's still the numpy creation that is the hot spot here, and regardless how complex the structure may be, it's still damn expensive... – Stefano Borini Jul 02 '11 at 20:40
  • 3
    But consider: Creating the data is rarely the bottleneck in an application that so complex it uses numpy. I don't know what happens under the hood either, but it obviously makes math-heavy programs faster at the end of the day, so there's no reason to complain ;) –  Jul 02 '11 at 20:49
  • 7
    @Stefano: aren't you including the import of numpy in the timings? (Also python has a builtin timings module.) – Katriel Jul 02 '11 at 21:09
  • 1
    Just quick tip, you can use `python -mtimeit test.py` to do benchmarking. – igorgue Jul 02 '11 at 21:21
  • Does `numpy` have a mechanism to reuse unused arrays? cause python list has. And note that `numpy.array` needs a lookup in the `numpy` object for the `array` attribute, `[]` constructor performs no lookups even this is not really a performance penalty. – mg. Jul 02 '11 at 21:24
  • @Stefano That bytecode is fairly optimised. BUILD_LIST builds a list from directly off the stack as opposed to creating it via a series of appends or whatever. Note also the use of LOAD_CONST. Meaning not only does `a[0] == a[1]` evaluate to True, but so does `a[0] is a[1]` -- only one number is ever created, as opposed to three. – Dunes Jul 02 '11 at 21:25

4 Answers4

46

Numpy is optimised for large amounts of data. Give it a tiny 3 length array and, unsurprisingly, it performs poorly.

Consider a separate test

import timeit

reps = 100

pythonTest = timeit.Timer('a = [0.] * 1000000')
numpyTest = timeit.Timer('a = numpy.zeros(1000000)', setup='import numpy')
uninitialised = timeit.Timer('a = numpy.empty(1000000)', setup='import numpy')
# empty simply allocates the memory. Thus the initial contents of the array 
# is random noise

print 'python list:', pythonTest.timeit(reps), 'seconds'
print 'numpy array:', numpyTest.timeit(reps), 'seconds'
print 'uninitialised array:', uninitialised.timeit(reps), 'seconds'

And the output is

python list: 1.22042918205 seconds
numpy array: 1.05412316322 seconds
uninitialised array: 0.0016028881073 seconds

It would seem that it is the zeroing of the array that is taking all the time for numpy. So unless you need the array to be initialised then try using empty.

Dunes
  • 37,291
  • 7
  • 81
  • 97
  • 1
    For fairness, you should have done `pythonTest = timeit.Timer('a = [0.] * 1000000')`, it still performs slower than numpy but it's quite faster than a LC. And it is "closer" to a list literal (as given in the question) in that it doesn't run a Python loop. – Rosh Oxymoron Jul 02 '11 at 21:43
  • 1
    @Rosh Good point. I think I've always shied away from the `*` operator for lists as it puts the same object in each index. Though since numbers are immutable that doesn't matter in this case. Though try performing a mass operation on the list/array then numpy is way out in the lead again (eg. arr += 1). – Dunes Jul 02 '11 at 21:56
  • Very good point, thank you. Considering the result, what would you suggest for small arrays ? I mean, lists and tuples are not really nice when it comes to basic array operations (such as vector-vector product, multiplication of array times a number etc, determinant of small matrices) Of course I can reimplement the algos by myself, it's not the big problem here, but if there's already something for that, I consider it the preferred solution. – Stefano Borini Jul 02 '11 at 22:02
  • Separate question? But anyway itertools docs suggest that you can make very efficient vector functions with a combination of itertool and operator functions. – Dunes Jul 02 '11 at 22:17
  • 1
    @Stefano Borini: Simply; do not try to optimize with small `numpy` arrays. Instead, try to consolidate the operations for much bigger chunks. Anyway, it seems that your rant is based only on creation of small arrays. Please describe your real problem, in order to figure out whether it's more suitable to solve it in 'pure python' or 'numpy' realm. Thanks – eat Jul 02 '11 at 22:22
  • @eat : rant? I'm not ranting. Until a few minutes ago, my main feeling was like a big question mark bouncing on my head. I can (and probably will) consolidate data into larger arrays, but it's too soon for now. I'm writing a raytracer, have to perform some 3d operations, but I don't want to go heavy with optimization immediately, because I am also writing a tutorial with it. – Stefano Borini Jul 02 '11 at 22:26
  • @Stefano Borini: Apologies, but I felt it like a rant. Anyway I think `numpy` will play very nicely with you, if you are able to utilize linear algebra proper manner. Thanks – eat Jul 02 '11 at 22:43
  • @eat : if I were ranting, I'd have said "Why this crappy library takes to much time to create a damn array". That is the wording for a rant. Mine is curiosity. I see an event, I post a question, I get good answers, as normally happens. – Stefano Borini Jul 03 '11 at 11:32
4

Holy CPU cycles batman!, indeed.

But please rather consider something very fundamental related to numpy; sophisticated linear algebra based functionality (like random numbers or singular value decomposition). Now, consider these seamingly simple calculations:

In []: A= rand(2560000, 3)
In []: %timeit rand(2560000, 3)
1 loops, best of 3: 296 ms per loop
In []: %timeit u, s, v= svd(A, full_matrices= False)
1 loops, best of 3: 571 ms per loop

and please trust me that this kind of performance will not be beaten significantly by any package currently available.

So, please describe your real problem, and I'll try to figure out decent numpy based solution for it.

Update:
Here is some simply code for ray sphere intersection:

import numpy as np

def mag(X):
    # magnitude
    return (X** 2).sum(0)** .5

def closest(R, c):
    # closest point on ray to center and its distance
    P= np.dot(c.T, R)* R
    return P, mag(P- c)

def intersect(R, P, h, r):
    # intersection of rays and sphere
    return P- (h* (2* r- h))** .5* R

# set up
c, r= np.array([10, 10, 10])[:, None], 2. # center, radius
n= 5e5
R= np.random.rand(3, n) # some random rays in first octant
R= R/ mag(R) # normalized to unit length

# find rays which will intersect sphere
P, b= closest(R, c)
wi= b<= r

# and for those which will, find the intersection
X= intersect(R[:, wi], P[:, wi], r- b[wi], r)

Apparently we calculated correctly:

In []: allclose(mag(X- c), r)
Out[]: True

And some timings:

In []: % timeit P, b= closest(R, c)
10 loops, best of 3: 93.4 ms per loop
In []: n/ 0.0934
Out[]: 5353319 #=> more than 5 million detection's of possible intersections/ s
In []: %timeit X= intersect(R[:, wi], P[:, wi], r- b[wi])
10 loops, best of 3: 32.7 ms per loop
In []: X.shape[1]/ 0.0327
Out[]: 874037 #=> almost 1 million actual intersections/ s

These timings are done with very modest machine. With modern machine, a significant speed-up can be still expected.

Anyway, this is only a short demonstration how to code with numpy.

eat
  • 7,440
  • 1
  • 19
  • 27
  • my real problem : http://stackoverflow.com/questions/6528214/improving-performance-of-raytracing-hit-function – Stefano Borini Jul 02 '11 at 23:13
  • good. However, it does not really allow you to deal with Sphere objects directly this way. You must have a backend that converts the high level design into an aggregated set of coordinates that are then fed to numpy. – Stefano Borini Jul 03 '11 at 10:55
  • +1 for "please rather consider something very fundamental related to numpy" – doug Jul 03 '11 at 11:05
  • @Stefano Borini: Well, I still don't know what your really are trying to do, but in order to utilize `numpy` efficient manner you should process with reasonable 'chunks'. Why not keep your OO-design, but don't store co-ordinates individually in objects. It's straightforward to have a mapping between objects and columns (or rows). Please also note how easily and IMHO readable (close to higher level of math involved) code you'll able to produce with `numpy`. Thanks – eat Jul 03 '11 at 11:27
  • @eat: I want to have individual geometric objects such as sphere, plane and so on, and I would like these objects to know their own information, such as geometry, and also be able to tell if they are intersected or not. Most of the operations I perform happen via these coords, which means that every time I do something with a coordinate, a numpy array is most likely created (for temps, for final results such as intersection point, and so on). What you propose does not consider that I may have a single sphere, which is just a 3-array, so I will never have a huge array to perform operations on. – Stefano Borini Jul 03 '11 at 11:36
  • 1
    @Stefano Borini: FWIW at least you seem to have plenty of rays. I still would recommend to keep all your 'permanent' points in array and write such code which let `numpy` to handle the temporaries, i.e. minimize the need to create small `numpy` arrays. Good luck! Thanks – eat Jul 03 '11 at 11:54
2

Late answer, but could be important for other viewers.

This problem has been considered in the kwant project as well. Indeed small arrays are not optimized in numpy and quite frequently small arrays are exactly what you need.

In this regard they created a substitute for small arrays which behaves and co-exists with the numpy arrays (any non-implemented operation in the new data-type is processed by numpy).

You should look into this project:
https://pypi.python.org/pypi/tinyarray/1.0.5
which main purpose is to behave nicely for small arrays. Of course some of the more fancy things you can do with numpy is not supported by this. But numerics seems to be your request.

I have made some small tests:

python

I have added numpy import to get the load time correct

import numpy

def main():
    for i in xrange(2560000):
        a = [0.0, 0.0, 0.0]

main()

numpy

import numpy

def main():
    for i in xrange(2560000):
        a = numpy.array([0.0, 0.0, 0.0])

main()

numpy-zero

import numpy

def main():
    for i in xrange(2560000):
        a = numpy.zeros((3,1))

main()

tinyarray

import numpy,tinyarray

def main():
    for i in xrange(2560000):
        a = tinyarray.array([0.0, 0.0, 0.0])

main()

tinyarray-zero

import numpy,tinyarray

def main():
    for i in xrange(2560000):
        a = tinyarray.zeros((3,1))

main()

I ran this:

for f in python numpy numpy_zero tiny tiny_zero ; do 
   echo $f 
   for i in `seq 5` ; do 
      time python ${f}_test.py
   done 
 done

And got:

python
python ${f}_test.py  0.31s user 0.02s system 99% cpu 0.339 total
python ${f}_test.py  0.29s user 0.03s system 98% cpu 0.328 total
python ${f}_test.py  0.33s user 0.01s system 98% cpu 0.345 total
python ${f}_test.py  0.31s user 0.01s system 98% cpu 0.325 total
python ${f}_test.py  0.32s user 0.00s system 98% cpu 0.326 total
numpy
python ${f}_test.py  2.79s user 0.01s system 99% cpu 2.812 total
python ${f}_test.py  2.80s user 0.02s system 99% cpu 2.832 total
python ${f}_test.py  3.01s user 0.02s system 99% cpu 3.033 total
python ${f}_test.py  2.99s user 0.01s system 99% cpu 3.012 total
python ${f}_test.py  3.20s user 0.01s system 99% cpu 3.221 total
numpy_zero
python ${f}_test.py  1.04s user 0.02s system 99% cpu 1.075 total
python ${f}_test.py  1.08s user 0.02s system 99% cpu 1.106 total
python ${f}_test.py  1.04s user 0.02s system 99% cpu 1.065 total
python ${f}_test.py  1.03s user 0.02s system 99% cpu 1.059 total
python ${f}_test.py  1.05s user 0.01s system 99% cpu 1.064 total
tiny
python ${f}_test.py  0.93s user 0.02s system 99% cpu 0.955 total
python ${f}_test.py  0.98s user 0.01s system 99% cpu 0.993 total
python ${f}_test.py  0.93s user 0.02s system 99% cpu 0.953 total
python ${f}_test.py  0.92s user 0.02s system 99% cpu 0.944 total
python ${f}_test.py  0.96s user 0.01s system 99% cpu 0.978 total
tiny_zero
python ${f}_test.py  0.71s user 0.03s system 99% cpu 0.739 total
python ${f}_test.py  0.68s user 0.02s system 99% cpu 0.711 total
python ${f}_test.py  0.70s user 0.01s system 99% cpu 0.721 total
python ${f}_test.py  0.70s user 0.02s system 99% cpu 0.721 total
python ${f}_test.py  0.67s user 0.01s system 99% cpu 0.687 total

Now these tests are (as already pointed out) not the best tests. However, they still show that tinyarray is better suited for small arrays.
Another fact is that the most common operations should be faster with tinyarray. So it might have better benefits of usage than just data creations.

I have never tried it in a fully fledged project, but the kwant project is using it

nickpapior
  • 752
  • 1
  • 8
  • 14
  • on a side-note, if some `numpy` function is creating too much overhead, it can sometimes benefit to defer to it by a single function, instead of looking it up in the module, i.e. `d = numpy.array ; a = d([0. 0. 0.])`. – nickpapior Apr 28 '14 at 07:12
0

Of course numpy consumes more time in this case, since: a = np.array([0.0, 0.0, 0.0]) <=~=> a = [0.0, 0.0, 0.0]; a = np.array(a), it took two steps. But numpy-array has many good qualities, its high speed can be seen in the operations on them, not the creation of them. Part of my personal thoughts:).

ZhengPeng
  • 1
  • 1