5

My expectation was that pypy could be as much as an order of magnitude faster than python, but the results indicate that pypy is in fact slower than expected.

I have two questions:

  1. Why is pypy significantly slower with numpy?
  2. Is there anything I can do to optimize my algorithm to make pypy (or python) faster?

Resulting timings:

Python 2.7.5

  • # points: 16,777,216 (8 ** 3 * 32 ** 3)
  • Xrange time: 1487.15 ms
  • Xrange Numpy time: 2553.98 ms
  • Point Gen Time: 6162.23 ms
  • Numpy Gen Time: 13894.73 ms

Pypy 2.2.1

  • # points: 16,777,216 (8 ** 3 * 32 ** 3)
  • Xrange time: 129.48 ms
  • Xrange Numpy time: 4644.12 ms
  • Point Gen Time: 4643.82 ms
  • Numpy Gen Time: 44168.98 ms

Algorithm:

I am generating a list of points in space using a simple algorithm and I'm trying to optimize the algorithm.

def generate(size=32, point=(0, 0, 0), width=32):
    """
    generate points in space around a center point with a specific width and
      number of divisions (size)
    """
    X, Y, Z = point
    half = width * 0.5
    delta = width
    scale = width / size
    offset = scale * 0.5
    X = X + offset - half
    Y = Y + offset - half
    Z = Z + offset - half
    for x in xrange(size):
        x = (x * scale) + X
        for y in xrange(size):
            y = (y * scale) + Y
            for z in xrange(size):
                z = (z * scale) + Z
                yield (x, y, z)

In optimizing, I started looking at using pypy rather than python. And in comparing the two, I came up with a few different scenarios:

  • counting using xrange

    rsize = 8    # size of region
    csize = 32   # size of chunk
    number_of_points = rsize ** 3 * csize ** 3
    [x for x in xrange(number_of_points)]
    
  • counting using xrange with numpy

    rsize = 8    # size of region
    csize = 32   # size of chunk
    number_of_points = rsize ** 3 * csize ** 3
    np.array([x for x in xrange(number_of_points)])
    
  • running with the algorithm above

    rsize = 8    # size of region
    csize = 32   # size of chunk
    [p
     for rp in generate(size=rsize, width=rsize*csize)
     for p in generate(size=csize, width=csize, point=rp)]
    
  • running the algorithm above with numpy

    rsize = 8    # size of region
    csize = 32   # size of chunk
    np.array([p
     for rp in generate(size=rsize, width=rsize*csize)
     for p in generate(size=csize, width=csize, point=rp)])
    

Background:

I am attempting to create a voxel engine, and I want to optimize my algorithm to bring down generation times to a manageable level. While I clearly won't achieve anything close to Java/C++, I'd like to push python (or pypy) as far as I can.

I noticed that list lookups are significantly faster than dictionaries from some earlier timings. And lists are also faster than tuples (unexpected), though tuples generate faster. Numpy has even faster read times than non-numpy. However, numpy creation time can be orders of magnitude slower.

So if reading is most important, there is a clear advantage for using a numpy. If reading and creation are of equal importance, however, then a straight list is likely best. That said, I don't have a clean way of looking at memory usage, but I suspect that lists are far less memory efficient than tuples or numpy. Also, while it's a small difference, I found .get on a dictionary to be slightly faster than using the __ getitem __ call (i.e. dictionary[lookup] vs. dicitonary.get(lookup) )

timings ...

Python 2.7.5

Reading

 - Option 1: tuple access... 2045.51 ms
 - Option 2: tuple access (again)... 2081.97 ms    # sampling effect of cache
 - Option 3: list access... 2072.09 ms
 - Option 4: dict access... 3436.53 ms
 - Option 5: iterable creation... N/A
 - Option 6: numpy array... 1752.44 ms

Creation

 - Option 1: tuple creation... 690.36 ms
 - Option 2: tuple creation (again)... 716.49 ms    # sampling effect of cache
 - Option 3: list creation... 684.28 ms
 - Option 4: dict creation... 1498.94 ms
 - Option 5: iterable creation...  0.01 ms
 - Option 6: numpy creation... 3514.25 ms

Pypy 2.2.1

Reading

 - Option 1: tuple access... 243.34 ms
 - Option 2: tuple access (again)... 246.51 ms    # sampling effect of cache
 - Option 3: list access... 139.65 ms
 - Option 4: dict access... 454.65 ms
 - Option 5: iterable creation... N/A
 - Option 6: numpy array... 21.60 ms

Creation

 - Option 1: tuple creation... 1016.27 ms
 - Option 2: tuple creation (again)... 1063.50 ms    # sampling effect of cache
 - Option 3: list creation... 365.98 ms
 - Option 4: dict creation... 2258.44 ms
 - Option 5: iterable creation...  0.00 ms
 - Option 6: numpy creation... 12514.20 ms

In all the examples, a random lookup was generated for random data.

dsize = 10 ** 7   # or 10 million data points
data = [(i, random.random()*dsize)
         for i in range(dsize)]
lookup = tuple(int(random.random()*dsize) for i in range(dsize))

The loops were very simple:

for x in lookup:
    data_of_specific_type[x]

And the data_of_specific_type was a conversion of data into that type (e.g. tuple(data), list(data), etc.)

Brian Bruggeman
  • 5,008
  • 2
  • 36
  • 55
  • On [here][1] it mentions that there are issues with numpy's speed: `To achieve this, we would need to teach our JIT backends how to use modern vector instructions, like SSE or AVM.` Don't quote me, as I don't know for sure, but there's a chance that could be at least part of the reason why Pypy is so slow... Someone else should try confirming this... [1]: http://pypy.org/numpydonate.html – Dair May 24 '14 at 01:06

1 Answers1

1

Part of the issue is that:

np.array([p
    for rp in generate(size=rsize, width=rsize*csize)
    for p in generate(size=csize, width=csize, point=rp)])

Does all the work of creating the list and converting it to an np.array.

A faster way to do this would be:

arr = np.empty(size)
i = 0
for rp in generate(size=rsize, width=rsize*csize):
    for p in generate(size=csize, width=csize, point=rp):
        arr[i] = p
        i += 1
Alex Gaynor
  • 14,353
  • 9
  • 63
  • 113
  • looks like with pypy 2.3 and small numbers (no more than 32k points), then the arr[i] = p seems to work just fine. However, beyond 32k points, we see a diversion between list comprehension and a generation method (see gist: https://gist.github.com/brianbruggeman/6d9d6439332b7a9d2ef4) – Brian Bruggeman May 24 '14 at 04:00