4

I recently switched from Matlab to Python. While converting one of my lengthy codes, I was surprised to find Python being very slow. I profiled and traced the problem with one function hogging up time. This function is being called from various places in my code (being part of other functions which are recursively called). Profiler suggests that 300 calls are made to this function in both Matlab and Python.

In short, following codes summarizes the issue at hand:

MATLAB

The class containing the function:

classdef ExampleKernel1 < handle  
methods (Static)
    function [kernel] = kernel_2D(M,x,N,y) 
        kernel  = zeros(M,N);
        for i= 1 : M
            for j= 1 : N
                % Define the custom kernel function here
                kernel(i , j) = sqrt((x(i , 1) - y(j , 1)) .^ 2 + ...
                                (x(i , 2) - y(j , 2)) .^2 );             
            end
        end
    end
end
end

and the script to call test.m:

xVec=[   
49.7030   78.9590
42.6730   11.1390
23.2790   89.6720
75.6050   25.5890
81.5820   53.2920
44.9680    2.7770
38.7890   78.9050
39.1570   33.6790
33.2640   54.7200
4.8060   44.3660
49.7030   78.9590
42.6730   11.1390
23.2790   89.6720
75.6050   25.5890
81.5820   53.2920
44.9680    2.7770
38.7890   78.9050
39.1570   33.6790
33.2640   54.7200
4.8060   44.3660
];
N=size(xVec,1);
kex1=ExampleKernel1;
tic
for i=1:300
    K=kex1.kernel_2D(N,xVec,N,xVec);
end
toc

Gives the output

clear all
>> test
Elapsed time is 0.022426 seconds.
>> test
Elapsed time is 0.009852 seconds.

PYTHON 3.4

Class containing the function CustomKernels.py:

from numpy import zeros
from math import sqrt
class CustomKernels:
"""Class for defining the custom kernel functions"""
    @staticmethod
    def exampleKernelA(M, x, N, y):
        """Example kernel function A"""
        kernel = zeros([M, N])
        for i in range(0, M):
            for j in range(0, N):
                # Define the custom kernel function here
                kernel[i, j] = sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
        return kernel

and the script to call test.py:

import numpy as np
from CustomKernels import CustomKernels
from time import perf_counter

xVec = np.array([
    [49.7030,  78.9590],
    [42.6730,  11.1390],
    [23.2790,  89.6720],
    [75.6050,  25.5890],
    [81.5820,  53.2920],
    [44.9680,   2.7770],
    [38.7890,  78.9050],
    [39.1570,  33.6790],
    [33.2640,  54.7200],
    [4.8060 ,  44.3660],
    [49.7030,  78.9590],
    [42.6730,  11.1390],
    [23.2790,  89.6720],
    [75.6050,  25.5890],
    [81.5820,  53.2920],
    [44.9680,   2.7770],
    [38.7890,  78.9050],
    [39.1570,  33.6790],
    [33.2640,  54.7200],
    [4.8060 ,  44.3660]
    ])
N = xVec.shape[0]
kex1 = CustomKernels.exampleKernelA
start=perf_counter()
for i in range(0,300):
    K = kex1(N, xVec, N, xVec)
print(' %f secs' %(perf_counter()-start))

Gives the output

%run test.py
 0.940515 secs
%run test.py
 0.884418 secs
%run test.py
 0.940239 secs

RESULTS

Comparing the results it seems Matlab is about 42 times faster after a "clear all" is called and then 100 times faster if script is run multiple times without calling "clear all". That is at least and order of magnitude if not two orders of magnitudes faster. This is a very surprising result for me. I was expecting the result to be the other way around.

Can someone please shed some light on this?

Can someone suggest a faster way to perform this?

SIDE NOTE

I have also tried to use numpy.sqrt which makes the performance worse, therefore I am using math.sqrt in Python.

EDIT

The for loops for calling the functions are purely fictitious. They are there just to "simulate" 300 calls to the function. As I described earlier, the kernel functions (kernel_2D in Matlab and kex1 in Python) are called from various different places in the program. To make the problem shorter, I "simulate" the 300 calls using the for loop. The for loops inside the kernel functions are essential and unavoidable because of the structure of the kernel matrix.

EDIT 2

Here is the larger problem: https://github.com/drfahdsiddiqui/bbfmm2d-python

Fahd Siddiqui
  • 273
  • 3
  • 9
  • 1
    Generally don't try and loop over an array in python. Call the operations on the entire array(s) using numpy so the actual per-element calcualtion is done inside the library – Martin Beckett Sep 28 '17 at 17:34
  • 1
    The power of `numpy` is the ability to get rid of those `for` loops – Daniel F Sep 28 '17 at 17:35
  • I see what you are saying, this is true for Matlab as well. But the structure of the kernel matrix makes a for looping unavoidable in this case. At any rate, why is function calling so expensive in Python and less so in Matlab? – Fahd Siddiqui Sep 28 '17 at 17:36
  • If you have an example of the kernel matrix you want to create, some wizard here can probably find a way to vectorize it. And if not, you can always `@jit` it. – Daniel F Sep 28 '17 at 17:41
  • To me, it looks like you can just use SciPy cdist - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html – Divakar Sep 28 '17 at 17:42
  • I think that's a placeholder function, @Divakar. If not you're absolutely right. – Daniel F Sep 28 '17 at 17:44
  • @DanielF What's a placeholder function? – Divakar Sep 28 '17 at 17:45
  • `exampleKernelA` – Daniel F Sep 28 '17 at 17:45
  • 2
    If the problem is the loop by which you call the `exampleKernelA` function 300 times, you should probably consider `numba`'s `@jit`. In general, looping in Python is slow compared to just-in-time (or ahead-of-time of course) compiled languages like modern *MATLAB* distrubutions. – norok2 Sep 28 '17 at 17:50
  • @Divakar very good suggestion, I didn't know about that. Alas I have 12 much more complicated kernel functions which are not a mere distance. Moreover this is just a simple example and not a real kernel function that I would use. – Fahd Siddiqui Sep 28 '17 at 17:57
  • @norok2 The 300 calls are fictitious. The function is called from various different places 300 times (part of a series of recursive functions which depend on if conditions). I understand that moving the `for` loop in the function body will make it faster, but it doesn't solve my larger problem. – Fahd Siddiqui Sep 28 '17 at 18:01
  • @FahdSiddiqui I would go on a case by case basis for each of the kernels. Doesn't sound tidy, but I guess getting performance demands it in most scenarios. – Divakar Sep 28 '17 at 18:11
  • In older MATLABs avoiding loops was also important. But newer versions do some `just in time` compiling that speeds up many loops. – hpaulj Sep 28 '17 at 18:42
  • 1
    Given that you already have access to C++ code (as per your *EDIT 2*), I would consider generating bindings of that code to Python rather than translating it, unless you are doing this translation for specific reasons other than having the algorithm available in Python. – norok2 Sep 28 '17 at 19:14
  • It is a bit ironic that Python has become popular for scientific computing, because, as you've found, pure Python is (by some subjective standards), *slow*. Hence, tools and libraries such numpy, cython (derived from pyrex), weave (now defunct), numba, and pypy; plus an assortment of tools that attempt to make it easier to call compiled languages from Python: ctypes (std lib), cffi, swig, f2py, boost.python, xdress, pybind11, ... – Warren Weckesser Sep 28 '17 at 19:16
  • @WarrenWeckesser My initial impressions agree with you. @norok2 Yes I can bind to `C++` but that has no learning for me. Purely academic exercise (hence the specific reason i.e. learning `Python`). In fact I can code everything in C++ but I want a quicker language (`Matlab` is excellent for that) to prototype my ideas before putting them in `C++`. I want to move away from proprietary (`Matlab`) to free (`Python` etc.). So trying a few languages before I settle. Thanks everyone! – Fahd Siddiqui Sep 29 '17 at 21:42

5 Answers5

2

You want to get rid of those for loops. Try this:

def exampleKernelA(M, x, N, y):
    """Example kernel function A"""
    i, j = np.indices((N, M))
    # Define the custom kernel function here
    kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
    return kernel

You can also do it with broadcasting, which may be even faster, but a little less intuitive coming from MATLAB.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • Thanks. The suggested form does not work. `line 41, in exampleKernelA kernel[i, j] = sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2) TypeError: only length-1 arrays can be converted to Python scalars` – Fahd Siddiqui Sep 28 '17 at 18:04
  • @FahdSiddiqui You need to use `np.sqrt()`, not `math.sqrt()`. And more importantly, if you want to have any chance of porting your more complicated kernels, you need to understand how NumPy works. Spend half a day on reading the [manual](https://docs.scipy.org/doc/numpy/user/index.html) – it will pay off. – Sven Marnach Sep 28 '17 at 18:49
  • 4
    Why is broadcasting _a little less intuitive coming from Matlab_? Matlab has had broadcasting (with a different name) [since 2007](https://es.mathworks.com/help/matlab/ref/bsxfun.html), and it takes place implicitly [since 2017](https://blogs.mathworks.com/loren/2016/10/24/matlab-arithmetic-expands-in-r2016b/) – Luis Mendo Sep 28 '17 at 20:14
  • 3
    Sorry, my last `MATLAB` experience is . . a while ago. Man I feel old now. – Daniel F Sep 29 '17 at 05:30
  • @LuisMendo you don't really suggest that bsxfun is intuitive now do you ? And technically that's not broadcasting but element-wise op – percusse Sep 29 '17 at 15:59
  • @percusse It _is_ broadcasting. The `b` and `s` in `bsxfun` refer to _singleton expansion_, which is Matlab's name for broadcasting. Whether `bsxfun` is intuitive or not is subjective – Luis Mendo Sep 29 '17 at 17:31
  • @LuisMendo you are forgetting the (b)inary which is key here. You wouldn't have shape expansions with `bsxfun` such as shape mismatching ops lead to error. Broadcasting is not limited to an expansion of a binary op but also more general which is what matlab tried to mimic. The proper broadcasting was introduced in octave 2012, numpy quite some years now etc. It's just keeping up. Op *expansion* to the whole array is not the story here. – percusse Sep 29 '17 at 17:44
  • 1
    @percusse I don't follow. Can you give an example in Octave or Numpy where the broadcasting is not for a binary (i.e. two-input) operator? – Luis Mendo Sep 29 '17 at 17:46
  • 1
    @percusse for a reasonable discussion of this matter you'd have to first define broadcasting, because I have to agree with Luis that I don't understand your distinction. Also, I don't believe broadcasting is intuitive if you don't understand how bsxfun behaves. – Andras Deak -- Слава Україні Sep 29 '17 at 21:00
  • 1
    @DanielF Much better performance with your suggestion. Should have thought of that! The improvement is significant, from ~0.94 seconds to 0.068 seconds. However `Matlab` is **still** 3 to 6 times faster than `numpy`. I will accept your answer. Thanks – Fahd Siddiqui Sep 29 '17 at 21:26
  • @AndrasDeak As you wish. Just look at how many people asking here about bsxfun related questions in matlab which won't be asked had they understood bsxfun already. That should give an idea about intuitiveness. I really don't need to argue about matlab and its API. – percusse Sep 29 '17 at 21:27
  • @SvenMarnach Thank you for your suggestion. Made it work! God forbid if I should read a manual :P Just starting to learn Python for scientific computations, long way to go, much to learn. Want to know what all this hoopla is about Python. This community is great! Thank you everyone, learned a lot in 24 hours! – Fahd Siddiqui Sep 29 '17 at 21:34
  • @percusse No one here _needs_ to argue. It's just for the purpose of improving our knowledge. Pity that you don't want to share yours with us – Luis Mendo Sep 29 '17 at 21:59
  • @LuisMendo *argue* as in discuss. No hard feelings. It's just not me take it from Lauren https://blogs.mathworks.com/loren/2016/10/24/matlab-arithmetic-expands-in-r2016b/#c28438ca-1226-462a-b057-0bf9e30be680 or Clive Moler's blog or elsewhere. Calling it intuitive is quite some stretch from actual reality. There is absolutely nothing intuitive about matlab's design and people have been complaining about it as far as it goes about two decades since I've been (ab)using it. – percusse Sep 29 '17 at 22:41
  • @percusse As I said, the intuitiveness of `bsxfun` is subjective. I was more interested in your distinction between Matlab's broadcasting and other (?) types of non-binary (?) broadcasting – Luis Mendo Sep 29 '17 at 22:43
  • @LuisMendo The generic programming constructs. `bsxfun` takes a binary op and supplies arguments that are coming from both arrays. So in a sense it is a shorthand for a double loop. Broadcasting concept in general is an object property. Take [NumPy's](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) or [eigen's](https://eigen.tuxfamily.org/dox/group__TutorialReductionsVisitorsBroadcasting.html) examples. All have more flexible operations that you can design your own @func. `Reduce` is only one of the properties of broadcasting. – percusse Sep 29 '17 at 22:56
  • @LuisMendo See for example my own question where I am actually trying to stop broadcasting https://stackoverflow.com/questions/40694380/forcing-multiplication-to-use-rmul-instead-of-numpy-array-mul-or-byp – percusse Sep 29 '17 at 23:01
  • @percusse Thanks for the links. I'd say Numpy's broadcast is the same as Matlab's: "Two dimensions are compatible when they are equal, or one of them is 1". About `eigen` I cannot really say, I never used it – Luis Mendo Sep 29 '17 at 23:03
  • Upon further investigation I have found that using `indices` as indicated in the answer is still slower. **The Solution:** Use `meshgrid` ` def exampleKernelA(M, x, N, y): """Example kernel function A""" # Euclidean norm function implemented using meshgrid idea. # Fastest i1, j1 = meshgrid(y[:, 0], x[:, 0]) i2, j2 = meshgrid(y[:, 1], x[:, 1]) # Define custom kernel here kernel = sqrt((i1 - j1) ** 2 + (i2 - j2) ** 2) return kernel` **Result** Very very fast, 10 times faster than `indices` approach. – Fahd Siddiqui Oct 03 '17 at 16:40
2

Upon further investigation I have found that using indices as indicated in the answer is still slower.

Solution: Use meshgrid

def exampleKernelA(M, x, N, y):
    """Example kernel function A"""
    # Euclidean norm function implemented using meshgrid idea.
    # Fastest
    x0, y0 = meshgrid(y[:, 0], x[:, 0])
    x1, y1 = meshgrid(y[:, 1], x[:, 1])
    # Define custom kernel here
    kernel = sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
    return kernel

Result: Very very fast, 10 times faster than indices approach. I am getting times which are closer to C.

However: Using meshgrid with Matlab beats C and Numpy by being 10 times faster than both.

Still wondering why!

Fahd Siddiqui
  • 273
  • 3
  • 9
1

Matlab uses commerical MKL library. If you use free python distribution, check whether you have MKL or other high performance blas library used in python or it is the default ones, which could be much slower.

ahala
  • 4,683
  • 5
  • 27
  • 36
  • MKL relevant if BLAS routines are called, which isn't relevant in this example. Its only the jit compiler which matters here. – max9111 Jun 16 '18 at 15:37
1

Comparing Jit-Compilers

It has been mentioned that Matlab uses an internal Jit-compiler to get good performance on such tasks. Let's compare Matlabs jit-compiler with a Python jit-compiler (Numba).

Code

import numba as nb
import numpy as np
import math
import time

#If the arrays are somewhat larger it makes also sense to parallelize this problem
#cache ==True may also make sense
@nb.njit(fastmath=True) 
def exampleKernelA(M, x, N, y):
  """Example kernel function A"""
  #explicitly declaring the size of the second dim also improves performance a bit
  assert x.shape[1]==2
  assert y.shape[1]==2

  #Works with all dtypes, zeroing isn't necessary
  kernel = np.empty((M,N),dtype=x.dtype)
  for i in range(M):
    for j in range(N):
      # Define the custom kernel function here
      kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
  return kernel


def exampleKernelB(M, x, N, y):
    """Example kernel function A"""
    # Euclidean norm function implemented using meshgrid idea.
    # Fastest
    x0, y0 = np.meshgrid(y[:, 0], x[:, 0])
    x1, y1 = np.meshgrid(y[:, 1], x[:, 1])
    # Define custom kernel here
    kernel = np.sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
    return kernel

@nb.njit() 
def exampleKernelC(M, x, N, y):
  """Example kernel function A"""
  #explicitly declaring the size of the second dim also improves performance a bit
  assert x.shape[1]==2
  assert y.shape[1]==2

  #Works with all dtypes, zeroing isn't necessary
  kernel = np.empty((M,N),dtype=x.dtype)
  for i in range(M):
    for j in range(N):
      # Define the custom kernel function here
      kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
  return kernel


#Your test data
xVec = np.array([
    [49.7030,  78.9590],
    [42.6730,  11.1390],
    [23.2790,  89.6720],
    [75.6050,  25.5890],
    [81.5820,  53.2920],
    [44.9680,   2.7770],
    [38.7890,  78.9050],
    [39.1570,  33.6790],
    [33.2640,  54.7200],
    [4.8060 ,  44.3660],
    [49.7030,  78.9590],
    [42.6730,  11.1390],
    [23.2790,  89.6720],
    [75.6050,  25.5890],
    [81.5820,  53.2920],
    [44.9680,   2.7770],
    [38.7890,  78.9050],
    [39.1570,  33.6790],
    [33.2640,  54.7200],
    [4.8060 ,  44.3660]
    ])

#compilation on first callable
#can be avoided with cache=True
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)

t1=time.time()
for i in range(10_000):
  res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)

print(time.time()-t1)

t1=time.time()
for i in range(10_000):
  res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)

print(time.time()-t1)

t1=time.time()
for i in range(10_000):
  res=exampleKernelB(xVec.shape[0], xVec, xVec.shape[0], xVec)

print(time.time()-t1)

Performance

exampleKernelA: 0.03s
exampleKernelC: 0.03s
exampleKernelB: 1.02s
Matlab_2016b (your code, but 10000 rep., after few runs): 0.165s
max9111
  • 6,272
  • 1
  • 16
  • 33
  • Switch OP’s loops around, MATLAB code will be significantly faster. Also, fastmath should not feature in this comparison. – Cris Luengo Jun 16 '18 at 19:18
  • 1
    @Cris Luengo I already tried to switch the loops without effect (maybe because of the small array size) I will try it without fastmath and add the results. For a really fair comparison the newest Matlab version should be used to... Add your results. – max9111 Jun 16 '18 at 20:40
  • yes, you’re right, it’s a small array, it probably fits in the cache. Never mind. :) – Cris Luengo Jun 16 '18 at 20:54
1

I got ~5x speed improvement over the meshgrid solution using only broadcasting:

def exampleKernelD(M, x, N, y):
    return np.sqrt((x[:,1:] - y[:,1:].T) ** 2 + (x[:,:1] - y[:,:1].T) ** 2)
EGraw
  • 21
  • 5