Why is univariate Horner in Fortran faster than NumPy counterpart while bivariate Horner is not

Question

I want to perform polynomial calculus in Python. The polynomial package in numpy is not fast enough for me. Therefore I decided to rewrite several functions in Fortran and use f2py to create shared libraries which are easily imported into Python. Currently I am benchmarking my routines for univariate and bivariate polynomial evaluation against their numpy counterparts.

In the univariate routine I use Horner's method as does numpy.polynomial.polynomial.polyval. I have observed that the factor by which the Fortran routine is faster than the numpy counterpart increases as the order of the polynomial increases.

In the bivariate routine I use Horner's method twice. First in y and then in x. Unfortunately I have observed that for increasing polynomial order, the numpy counterpart catches up and eventually surpasses my Fortran routine. As numpy.polynomial.polynomial.polyval2d uses an approach similar to mine, I consider this second observation to be strange.

I am hoping that this result stems forth from my inexperience with Fortran and f2py. Might someone have any clue why the univariate routine always appears superior, while the bivariate routine is only superior for low order polynomials?

EDIT Here is my latest updated code, benchmark script and performance plots:

polynomial.f95

subroutine polyval(p, x, pval, nx)

    implicit none

    real(8), dimension(nx), intent(in) :: p
    real(8), intent(in) :: x
    real(8), intent(out) :: pval
    integer, intent(in) :: nx
    integer :: i

    pval = 0.0d0
    do i = nx, 1, -1
        pval = pval*x + p(i)
    end do

end subroutine polyval

subroutine polyval2(p, x, y, pval, nx, ny)

    implicit none

    real(8), dimension(nx, ny), intent(in) :: p
    real(8), intent(in) :: x, y
    real(8), intent(out) :: pval
    integer, intent(in) :: nx, ny
    real(8) :: tmp
    integer :: i, j

    pval = 0.0d0
    do j = ny, 1, -1
        tmp = 0.0d0
        do i = nx, 1, -1
            tmp = tmp*x + p(i, j)
        end do
        pval = pval*y + tmp
    end do

end subroutine polyval2

subroutine polyval3(p, x, y, z, pval, nx, ny, nz)

    implicit none

    real(8), dimension(nx, ny, nz), intent(in) :: p
    real(8), intent(in) :: x, y, z
    real(8), intent(out) :: pval
    integer, intent(in) :: nx, ny, nz
    real(8) :: tmp, tmp2
    integer :: i, j, k

    pval = 0.0d0
    do k = nz, 1, -1
        tmp2 = 0.0d0
        do j = ny, 1, -1
            tmp = 0.0d0
            do i = nx, 1, -1
                tmp = tmp*x + p(i, j, k)
            end do
            tmp2 = tmp2*y + tmp
        end do
        pval = pval*z + tmp2
    end do

end subroutine polyval3

benchmark.py (use this script to produce plots)

import time
import os

import numpy as np
import matplotlib.pyplot as plt

# Compile and import Fortran module
os.system('f2py -c polynomial.f95 --opt="-O3 -ffast-math" \
--f90exec="gfortran-4.8" -m polynomial')
import polynomial

# Create random x and y value
x = np.random.rand()
y = np.random.rand()
z = np.random.rand()

# Number of repetition
repetition = 10

# Number of times to loop over a function
run = 100

# Number of data points
points = 26

# Max number of coefficients for univariate case
n_uni_min = 4
n_uni_max = 100

# Max number of coefficients for bivariate case
n_bi_min = 4
n_bi_max = 100

# Max number of coefficients for trivariate case
n_tri_min = 4
n_tri_max = 100

# Case on/off switch
case_on = [1, 1, 1]

case_1_done = 0
case_2_done = 0
case_3_done = 0

#=================#
# UNIVARIATE CASE #
#=================#

if case_on[0]:

    # Array containing the polynomial order + 1 for several univariate polynomials
    n_uni = np.array([int(x) for x in np.linspace(n_uni_min, n_uni_max, points)])

    # Initialise arrays for storing timing results
    time_uni_numpy = np.zeros(n_uni.size)
    time_uni_fortran = np.zeros(n_uni.size)

    for i in xrange(len(n_uni)):
        # Create random univariate polynomial of order n - 1
        p = np.random.rand(n_uni[i])

        # Time evaluation of polynomial using NumPy
        dt = []
        for j in xrange(repetition):
            t1 = time.time()
            for r in xrange(run): np.polynomial.polynomial.polyval(x, p)
            t2 = time.time()
            dt.append(t2 - t1)
        time_uni_numpy[i] = np.average(dt[2::])

        # Time evaluation of polynomial using Fortran
        dt = []
        for j in xrange(repetition):
            t1 = time.time()
            for r in xrange(run): polynomial.polyval(p, x)
            t2 = time.time()
            dt.append(t2 - t1)
        time_uni_fortran[i] = np.average(dt[2::])

    # Speed-up factor
    factor_uni = time_uni_numpy / time_uni_fortran

    results_uni = np.zeros([len(n_uni), 4])
    results_uni[:, 0] = n_uni
    results_uni[:, 1] = factor_uni
    results_uni[:, 2] = time_uni_numpy
    results_uni[:, 3] = time_uni_fortran
    print results_uni, '\n'

    plt.figure()
    plt.plot(n_uni, factor_uni)
    plt.title('Univariate comparison')
    plt.xlabel('# coefficients')
    plt.ylabel('Speed-up factor')
    plt.xlim(n_uni[0], n_uni[-1])
    plt.ylim(0, max(factor_uni))
    plt.grid(aa=True)

case_1_done = 1

#================#
# BIVARIATE CASE #
#================#

if case_on[1]:

    # Array containing the polynomial order + 1 for several bivariate polynomials
    n_bi = np.array([int(x) for x in np.linspace(n_bi_min, n_bi_max, points)])

    # Initialise arrays for storing timing results
    time_bi_numpy = np.zeros(n_bi.size)
    time_bi_fortran = np.zeros(n_bi.size)

    for i in xrange(len(n_bi)):
        # Create random bivariate polynomial of order n - 1 in x and in y
        p = np.random.rand(n_bi[i], n_bi[i])

        # Time evaluation of polynomial using NumPy
        dt = []
        for j in xrange(repetition):
            t1 = time.time()
            for r in xrange(run): np.polynomial.polynomial.polyval2d(x, y, p)
            t2 = time.time()
            dt.append(t2 - t1)
        time_bi_numpy[i] = np.average(dt[2::])

        # Time evaluation of polynomial using Fortran
        p = np.asfortranarray(p)
        dt = []
        for j in xrange(repetition):
            t1 = time.time()
            for r in xrange(run): polynomial.polyval2(p, x, y)
            t2 = time.time()
            dt.append(t2 - t1)
        time_bi_fortran[i] = np.average(dt[2::])

    # Speed-up factor
    factor_bi = time_bi_numpy / time_bi_fortran

    results_bi = np.zeros([len(n_bi), 4])
    results_bi[:, 0] = n_bi
    results_bi[:, 1] = factor_bi
    results_bi[:, 2] = time_bi_numpy
    results_bi[:, 3] = time_bi_fortran
    print results_bi, '\n'

    plt.figure()
    plt.plot(n_bi, factor_bi)
    plt.title('Bivariate comparison')
    plt.xlabel('# coefficients')
    plt.ylabel('Speed-up factor')
    plt.xlim(n_bi[0], n_bi[-1])
    plt.ylim(0, max(factor_bi))
    plt.grid(aa=True)

case_2_done = 1

#=================#
# TRIVARIATE CASE #
#=================#

if case_on[2]:

    # Array containing the polynomial order + 1 for several bivariate polynomials
    n_tri = np.array([int(x) for x in np.linspace(n_tri_min, n_tri_max, points)])

    # Initialise arrays for storing timing results
    time_tri_numpy = np.zeros(n_tri.size)
    time_tri_fortran = np.zeros(n_tri.size)

    for i in xrange(len(n_tri)):
        # Create random bivariate polynomial of order n - 1 in x and in y
        p = np.random.rand(n_tri[i], n_tri[i])

        # Time evaluation of polynomial using NumPy
        dt = []
        for j in xrange(repetition):
            t1 = time.time()
            for r in xrange(run): np.polynomial.polynomial.polyval3d(x, y, z, p)
            t2 = time.time()
            dt.append(t2 - t1)
        time_tri_numpy[i] = np.average(dt[2::])

        # Time evaluation of polynomial using Fortran
        p = np.asfortranarray(p)
        dt = []
        for j in xrange(repetition):
            t1 = time.time()
            for r in xrange(run): polynomial.polyval3(p, x, y, z)
            t2 = time.time()
            dt.append(t2 - t1)
        time_tri_fortran[i] = np.average(dt[2::])

    # Speed-up factor
    factor_tri = time_tri_numpy / time_tri_fortran

    results_tri = np.zeros([len(n_tri), 4])
    results_tri[:, 0] = n_tri
    results_tri[:, 1] = factor_tri
    results_tri[:, 2] = time_tri_numpy
    results_tri[:, 3] = time_tri_fortran
    print results_tri

    plt.figure()
    plt.plot(n_bi, factor_bi)
    plt.title('Trivariate comparison')
    plt.xlabel('# coefficients')
    plt.ylabel('Speed-up factor')
    plt.xlim(n_tri[0], n_tri[-1])
    plt.ylim(0, max(factor_tri))
    plt.grid(aa=True)
    print '\n'

case_3_done = 1

#==============================================================================

plt.show()

Results enter image description here

EDIT correction to the proposal of steabert

subroutine polyval(p, x, pval, nx)

    implicit none

    real*8, dimension(nx), intent(in) :: p
    real*8, intent(in) :: x
    real*8, intent(out) :: pval
    integer, intent(in) :: nx

    integer, parameter :: simd = 8
    real*8 :: tmp(simd), xpower(simd), maxpower
    integer :: i, j, k

    xpower(1) = x
    do i = 2, simd
        xpower(i) = xpower(i-1)*x
    end do
    maxpower = xpower(simd)

    tmp = 0.0d0
    do i = nx+1, simd+2, -simd
        do j = 1, simd
            tmp(j) = tmp(j)*maxpower + p(i-j)*xpower(simd-j+1)
        end do
    end do

    k = mod(nx-1, simd)
    if (k == 0) then
        pval = sum(tmp) + p(1)
    else
        pval = sum(tmp) + p(k+1)
        do i = k, 1, -1
            pval = pval*x + p(i)
        end do
    end if

end subroutine polyval

EDIT Test code to verify that the code directly above gives poor results for x > 1

import polynomial as P
import numpy.polynomial.polynomial as PP

import numpy as np

for n in xrange(2,100):
    poly1n = np.random.rand(n)
    poly1f = np.asfortranarray(poly1n)

    x = 2

    print np.linalg.norm(P.polyval(poly1f, x) - PP.polyval(x, poly1n)), '\n'

I have edited the code in `benchmark.py`. There was a small but significant error. I have uploaded also new result plots. The univariate result is now looking even better — Aeronaelius, Sep 09 '13 at 23:37
I wanted to note that ``pval`` and ``tmp`` seem to be used uninitialized in your fortran subroutines. — steabert, Sep 10 '13 at 09:17

score 6 · Answer 1 · answered Sep 10 '13 at 03:33

In the bivariate case, p is a two-dimensional array. This means that C vs fortran ordering of arrays are different. By default numpy functions give C ordering, and obviously fortran routines use fortran ordering.

f2py is smart enough to deal with this, and automatically converts between C and fortran format arrays. However, this results in some overhead, which is one of the possible reasons for reduced performance. You can check if this is the cause by manually converting p to fortran type using numpy.asfortranarray outside your timing routine. Of course, for this to be meaningful, in your real use case you want to make sure that your input arrays are in fortran order.

f2py has an option -DF2PY_REPORT_ON_ARRAY_COPY which can warn you any time an array is copied.

If this is not the cause, then you need to consider more in-depth details, such as which fortran compiler you are using, and what sort of optimisations it is applying. Examples of things which could slow you down include allocation of arrays on the heap instead of the stack (with expensive calls to malloc), although I would expect such effects to become less significant for larger array.

Finally, you should consider the possibility that for bivariate fitting, for large N, that the numpy routines are already essentially at optimum efficiency. In such cases, the numpy routine may be spending most of its time running optimised C routines, and the overhead of the python code becomes negligible in comparison. In this case, you would not expect your fortran code to show any significant speedup.

Thank you DaveP I got a 5x speedup by using `numpy.asfortranarray` for the bivariate case when p is of size 1000x1000 — Aeronaelius, Sep 10 '13 at 20:19

score 3 · Answer 2 · answered Sep 09 '13 at 23:03

3

I would guess, that your tmp array is getting too large, such, that it requires L2, L3 or even main memory accesses instead of caches. It might be better, to break these loops up and process only chunks of them at once (strip-mining).

answered Sep 09 '13 at 23:03

haraldkl

3,809
26
44

Strip-mining is a good idea. However `tmp` does not get too big. While p has size `NxN`, `tmp` only has size `N`. The decline in performance start around `N = 65`. I cannot see how an one-dimensional array of length 65 can be considered big. – Aeronaelius Sep 09 '13 at 23:17
I have adjusted my algorithm. `tmp` is now a scalar instead of a vector. Unfortunately the results are still the same. So the problem was not related to `tmp` – Aeronaelius Sep 10 '13 at 03:12
Ah, sorry I misinterpreted your plots to have the order of the polynomials on the X-axis and thus a larger range in N. I guess the explanation by DaveP is the best one then. – haraldkl Sep 10 '13 at 20:20
haraldkl you did not misinterpreted the plots. The x-axis gives the number of polynomial coefficients. So if the number is 10, the polynomial order is 9: order = number - 1. For the bivariate case the same applies as I only use polynomials with equal orders in x and y. – Aeronaelius Sep 10 '13 at 20:32
Fine enough, so you go up to 1000? I'd suggest a block length of around that 65 that you mentioned in an attempt to weaken the loss in performance above 100... – haraldkl Sep 10 '13 at 21:08
In the univariate case I go up to 1024^2 coefficients and in the bivariate case I go to 1024 coefficients in x and y. Thus in both cases p contains the same number of elements. – Aeronaelius Sep 10 '13 at 22:08

Alexander Vogt · Answer 3 · 2013-09-10T20:23:40.563

2

Your function is very short, so you would get better results by inlining polyval. Also you can avoid the calculation of your indices by simply inverting of the loop:

subroutine polyval2(p, x, y, pval, nx, ny)

    implicit none

    real(8), dimension(nx, ny), intent(in), target :: p
    real(8), intent(in) :: x, y
    real(8), intent(out) :: pval
    integer, intent(in) :: nx, ny
    real(8) :: tmp
    integer :: i, ii

    pval = 1.d0
    do i = ny, 1
        tmp = 1.d0
        do ii = nx, 1
            tmp = tmp*x + p(ii,i)
        end do
        pval = pval*y + tmp
    end do

end subroutine polyval2

With this code I got ~10% shorter execution time for large arrays compared to the original code you posted. (I tested a pure Fortran program with your code Nx=Ny=1000, gfortran -O3 -funroll-loops)

I agree with haraldkl, the sharp drop in performance when the dimensions get too large is very typical for cache/memory access patterns. Strip-mining helps, but I would not encourage to do that yourself. Use compiler flags instead: -floop-strip-mine for gfortran and (included in) -O3 for ifort. Also, try loop unrolling: -funroll-loops for gfortran and ifort.

You can specify those flags with f2py -c --f90flags="...".

edited Sep 10 '13 at 20:23

answered Sep 10 '13 at 19:08

Alexander Vogt

17,879
13
52
68

Great tips. I needed to add a step of -1 to make the loops work for inverted indices. However even after adding the flag `-funroll-loops` I did not get any speed-up. I tested from Python instead of testing a pure Fortran program. The biggest issue however is that the flag `-floop-strip-mine` gives me a compiler error because I am missing some package. Even after installing the package the error is not gone. The error is: `sorry, unimplemented: Graphite loop optimizations can only be used if the libcloog-ppl1 or libcloog-ppl0 package is installed`. I use `gfortran 4.6.4` – Aeronaelius Sep 10 '13 at 20:25
Mhm, it seems your compiler does not support this... Which version of gfortran are you using? You can download the latest binaries and the required infrastructure from this [GCC site](http://gcc.gnu.org/wiki/GFortranBinaries). For f2py you can then specify the compiler using `--f90exec`. – Alexander Vogt Sep 10 '13 at 20:37
I have gfortran 4.6, 4.7 and 4.8 installed. However f2py is defaulting to 4.6. I have now used `--f90exec=gfortran-4.8` and everything works. Thanks! – Aeronaelius Sep 10 '13 at 21:02

steabert · Accepted Answer · 2013-09-12T07:58:36.693

Following the other suggestions, using p=np.asfortranarray(p) before the timer indeed puts the performance on par with numpy when I tested it. I extended the range for the bivariate bench to n_bi = np.array([2**i for i in xrange(1, 15)]), so that the p matrix would be larger than my L3 cache size.

To further optimize this, I don't think automatic compiler options will be much help, since the inner loop has a dependency. Only if you manually unroll it, does ifort vectorize the innermost loop. With gfortran, -O3 and -ffast-math were needed. For matrix sizes limited by main memory bandwidth, this increase the performance benefit over numpy from a factor of 1 to 3.

Update: after applying this also to the univariate code and compiling with f2py --opt='-O3 -ffast-math' -c -m polynomial polynomial.f90, I get the following for the source and results for benchmark.py:

subroutine polyval(p, x, pval, nx)

implicit none

real*8, dimension(nx), intent(in) :: p
real*8, intent(in) :: x
real*8, intent(out) :: pval
integer, intent(in) :: nx

integer, parameter :: simd = 8
real*8 :: tmp(simd), vecx(simd), xfactor
integer :: i, j, k

! precompute factors
do i = 1, simd
    vecx(i)=x**(i-1)
end do
xfactor = x**simd

tmp = 0.0d0
do i = 1, nx, simd
    do k = 1, simd
        tmp(k) = tmp(k)*xfactor + p(nx-(i+k-1)+1)*vecx(simd-k+1)
    end do
end do
pval = sum(tmp)


end subroutine polyval

subroutine polyval2(p, x, y, pval, nx, ny)

implicit none

real*8, dimension(nx, ny), intent(in) :: p
real*8, intent(in) :: x, y
real*8, intent(out) :: pval
integer, intent(in) :: nx, ny

integer, parameter :: simd = 8
real*8 :: tmp(simd), vecx(simd), xfactor
integer :: i, j, k

! precompute factors
do i = 1, simd
    vecx(i)=x**(i-1)
end do
xfactor = x**simd

! horner
pval=0.0d0
do i = 1, ny
    tmp = 0.0d0
    do j = 1, nx, simd
        ! inner vectorizable loop
        do k = 1, simd
            tmp(k) = tmp(k)*xfactor + p(nx-(j+k-1)+1,ny-i+1)*vecx(simd-k+1)
        end do
    end do
    pval = pval*y + sum(tmp)
end do

end subroutine polyval2

Update 2: As pointed out, this code is not correct, at least when sizes are not divisible by simd. It's just showing the concept of manually helping the compiler, so don't just use it like this. If the sizes are not powers of two, a small remainder loop has to take care of the dangling indices. It's not so difficult to do this, here is the correct procedure for the univariate case, should be straightforward to extend it to bivariate:

subroutine polyval(p, x, pval, nx)
implicit none

real*8, dimension(nx), intent(in) :: p
real*8, intent(in) :: x
real*8, intent(out) :: pval
integer, intent(in) :: nx

integer, parameter :: simd = 4
real*8 :: tmp(simd), vecx(simd), xfactor
integer :: i, j, k, nr

! precompute factors
do i = 1, simd
    vecx(i)=x**(i-1)
end do
xfactor = x**simd

! check remainder
nr = mod(nx, simd)

! horner
tmp = 0.0d0
do i = 1, nx-nr, simd
    do k = 1, simd
        tmp(k) = tmp(k)*xfactor + p(nx-(i+k-1)+1)*vecx(simd-k+1)
    end do
end do
pval = sum(tmp)

! do remainder
pval = pval * x**nr
do i = 1, nr
    pval = pval + p(i) * vecx(i)
end do
end subroutine polyval

univariate

bivariate

Also, one should be careful with very small sizes, as the time will be too small to have an accurate performance profile. Also, relative times with respect to numpy could be deceiving, as the absolute time with numpy could be very bad. So below are timings for the largest case:

For univariate with nx=220, time is 1.21 s for numpy, and 1.69e-3 s for the custom fortran version. For bivariate with nxny=220, time is 8e-3 s for numpy, and 1.68e-3 s for the custom version. The fact that the time for both univariate and bivariate is the same when the total nxny size is the same is very important, as it supports the fact that the code is performing near the memory bandwidth limit.

Update 3: with the new python script for smaller sizes, and simd=4 I get the following performance:

enter image description here

Update 4: As for correctness, the results are the same within double precision accuracy, which you can see if you run this python code for the univariate example:

import polynomial as P
import numpy.polynomial.polynomial as PP

import numpy as np

for n in xrange(2,100):
    poly1n = np.random.rand(n)
    poly1f = np.asfortranarray(poly1n)

    x = 2

    print "%18.14e" % P.polyval(poly1f, x)
    print "%18.14e" % PP.polyval(x, poly1n)
    print (P.polyval(poly1f, x) - PP.polyval(x, poly1n))/PP.polyval(x,poly1n), '\n'

I ran your code, but for me it does not improve performance. I have applied a lot of the comments posted here. I will soon update my code and benchmark script. ps: I am limiting the size of p to max 100 (univariate), 100x100 (bivariate) and 100x100x100 (trivariate). This will be the range of my actual uses. — Aeronaelius, Sep 11 '13 at 10:27
it only improves performance with `gfortran` if you use `f2py --opt='-O3 -ffast-math' -c -m polynomial polynomial.f95` (at least with version 4.8 which I use...). Which processor do you use and what is your memory bandwidth? — steabert, Sep 11 '13 at 11:00
Intel i7-2670QM, 4 cores, 8 threads, 32 KB dedicated L1 cache, 256 KB dedicated L2 cache, 6 MB shared L3 cache. 8 GB RAM. See this page for more information: http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7%20Mobile%20i7-2670QM.html. I have updated my benchmark script to execute each function several times in order to deal with inaccurate performance profile for a polynomial with a small number of coefficients — Aeronaelius, Sep 11 '13 at 13:01
If you operate in the 10000 elements region for bivariate, that's a stream of 80kB for using p which fits completely in L2, so then performance is of course different as it's not determined by the main memory. Thanks for the updated python script, I'll give it a try. — steabert, Sep 11 '13 at 13:07
I ran your code again and damn performance went up. I am getting similar graphs as you did. I will post it later on. I do have a burning question. I am having a hard time understanding your code. What is it that you are trying to achieve with `simd` and `xfactor`? — Aeronaelius, Sep 11 '13 at 17:47
I have been examining your code. I now understand what you are trying to do. Unfortunately your polyval function for the univariate case is wrong. It produces the wrong answer. Compare the output of your code and that of numpy. I have fixed the code for the univariate case and will post it in my question up top. I will now continue to examine your bivariate code. — Aeronaelius, Sep 11 '13 at 22:30
I have further examined my correction on your code. Evaluating the code for small x (x << 1) works great, however for x >= 1 I get an error norm of 1e40. My original code does not deviate one bit. — Aeronaelius, Sep 11 '13 at 22:51
Hi, the code as posted isn't correct because it needs powers of two as the size. So, indeed, in the general case, you have to do a remainder loop an the end to take care of that. If I find the time, I'll update the answer, but it's basically looking if nx is divisible by simd, and then putting a small loop over `j` after the big loop to take care of that. From the beginning, I always printed the results, and when the sizes are powers of two, it matches the numpy result. — steabert, Sep 12 '13 at 05:45
The code is working better now for x << 1 however the code is still diverging drastically from the numpy result for x > 1. Try a test code I have posted up top — Aeronaelius, Sep 12 '13 at 07:37
(1) the order of evaluation of operations is different so it's normal that the results isn't bit-for-bit the same, but that could apply for any kind of optimization, (2) the results are actually correct within double-precision accuracy, which is what counts. I posted an update to explain. — steabert, Sep 12 '13 at 07:56
Thank you very much steabert. Your help is greatly appreciated. I will adopt your code and extend it for the bivariate and trivariate cases I will be using. Thank you very much! — Aeronaelius, Sep 12 '13 at 08:31

Why is univariate Horner in Fortran faster than NumPy counterpart while bivariate Horner is not

4 Answers4