3

I have a very simple problem: in my python toolbox, I have to compute the values of polynomials (usually degree 3 or 2, seldom others, always integer degree) from a large vector (size >> 10^6). Storing the result in a buffer is not an option because I have several of these vectors so I would quickly run out of memory, and I usually have to compute it only once in any case. The performance of numpy.polyval is actually quite good, but still this is my bottleneck. Can I somehow make the evaluation of the polynomial faster?

Addendum

I think that the pure-numpy solution of Joe Kington is good for me, in particular because it avoids potential issues at installation time of other libraries or cython. For those who asked, the numbers in the vector are large (order 10^4), so I don't think that the suggested approximations would work.

  • 1
    There are quite some questions here. Most of them cannot be answered in general, some of them cannot be answered unless you provide more detail, and the last one you should be able to answer by doing some googling. – cel Sep 11 '15 at 15:09
  • Well, if there is an algorithm faster than Horner (the method of `numpy.polyval`) for this problem on google, I just cannot find it. –  Sep 11 '15 at 15:14
  • You can consider dropping the third degree terms if the range of your `x` is less than 1. Or even better, you can approximate the third degree term by its derivative. – under_the_sea_salad Sep 11 '15 at 15:28
  • Horner is an algorithm which is supposed to optimize multiplications and it works well on high order polynoms. Are multiplications your bottleneck? – Vladimir Ignatev Sep 11 '15 at 15:46

2 Answers2

7

You actually can speed it up slightly by doing the operations in-place (or using numexpr or numba which will automatically do what I'm doing manually below).

numpy.polyval is a very short function. Leaving out a few type checks, etc, it amounts to:

def polyval(p, x):
    y = np.zeros_like(x)
    for i in range(len(p)):
        y = x * y + p[i]
    return y

The downside to this approach is that a temporary array will be created inside the loop as opposed to doing the operation in-place.

What I'm about to do is a micro-optimization and is only worthwhile for very large x inputs. Furthermore, we'll have to assume floating-point output instead of letting the upcasting rules determine the output's dtype. However, it will speed this up slighly and make it use less memory:

def faster_polyval(p, x):
    y = np.zeros(x.shape, dtype=float)
    for i, v in enumerate(p):
        y *= x
        y += v
    return y

As an example, let's say we have the following input:

# Third order polynomial
p = [4.5, 9.8, -9.2, 1.2]

# One-million element array
x = np.linspace(-10, 10, 1e6)

The results are identical:

In [3]: np_result = np.polyval(p, x)

In [4]: new_result = faster_polyval(p, x)

In [5]: np.allclose(np_result, new_result)
Out[5]: True

And we get a modest 2-3x speedup (which is mostly independent of array size, as it relates to memory allocation, not number of operations):

In [6]: %timeit np.polyval(p, x)
10 loops, best of 3: 20.7 ms per loop

In [7]: %timeit faster_polyval(p, x)
100 loops, best of 3: 7.46 ms per loop

For really huge inputs, the memory usage difference will matter more than the speed differences. The "bare" numpy version will use ~2x more memory at peak usage than the faster_polyval version.

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • Dang I was just in the process of rolling a cythonized version of this, but it looks like that a bare numpy optimized version performs just as fast, there is really a lot I need to learn about squeezing every last drop of performance out of numpy code. – Simon Gibbons Sep 11 '15 at 15:49
  • @SimonGibbons - To be fair, this is the type of problem where "bare" numpy shines. There are plenty of other cases where the Cythonized version would be a _lot_ faster. Thinking about the intermediate arrays created is usually key to squeezing the last drop of performance out of numpy, though, so it's usually a good place to start. – Joe Kington Sep 11 '15 at 15:51
  • There is just one thing I don't get, possibly silly: why do you use enumerate instead of range/xrange? –  Sep 11 '15 at 16:25
  • @AJC - There's no effective difference, it just combines the indexing and iterating together. It's purely a code cleaner. I'd be willing to bet that the original version of `polyval` was written before `enumerate` was widely available (it was added in 2.3 or 2.4). I used it out of habit, but probably should have left in the `range` and indexing to avoid any differences. Either way, it won't have any impact on performance. – Joe Kington Sep 11 '15 at 16:29
0

I ended up here, when I wanted to know whether np.polyval or np.polynomial.polynomial.polyval is faster. And it is interesting to see that simple implementations are faster as @Joe Kington shows. (I hoped for some optimisation by numpy.)

So here is my comparison with np.polynomial.polynomial.polyval and a slightly faster version.

def fastest_polyval(x, a):
    y = a[-1]
    for ai in a[-2::-1]:
        y *= x
        y += ai
    return y

It avoids the initial zero array and needs one loop less.

y_np = np.polyval(p, x)
y_faster = faster_polyval(p, x)
prev = 1 * p[::-1]   # reverse coefficients
y_np2 = np.polynomial.polynomial.polyval(x, prev)
y_fastest = fastest_polyval(x, prev)

np.allclose(y_np, y_faster), np.allclose(y_np, y_np2), np.allclose(y_np, y_fastest)
# (True, True, True)
%timeit np.polyval(p, x)
%timeit faster_polyval(p, x)
%timeit np.polynomial.polynomial.polyval(x, prev)
%timeit fastest_polyval(x, prev)

# 6.51 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3.69 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6.28 ms ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.65 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)