Suggestions on how to speed up a distance calculation

Question

Consider the following class:

class SquareErrorDistance(object):
    def __init__(self, dataSample):
        variance = var(list(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, u, v): # u and v are floats
        return (u - v) ** 2 * self._norm

I use it to calculate the distance between two elements of a vector. I basically create one instance of that class for every dimension of the vector that uses this distance measure (there are dimensions that use other distance measures). Profiling reveals that the __call__ function of this class accounts for 90% of the running-time of my knn-implementation (who would have thought). I do not think there is any pure-Python way to speed this up, but maybe if I implement it in C?

If I run a simple C program that just calculates distances for random values using the formula above, it is orders of magnitude faster than Python. So I tried using ctypes and call a C function that does the computation, but apparently the conversion of the parameters and return-values is far to expensive, because the resulting code is much slower.

I could of course implement the entire knn in C and just call that, but the problem is that, like I described, I use different distance functions for some dimension of the vectors, and translating these to C would be too much work.

So what are my alternatives? Will writing the C-function using the Python C-API get rid of the overhead? Are there any other ways to speed this calculation up?

I would suggest Cython (Answer with example implementation might come in a few minutes). I assume you algorithms are already as tuned as reasonably possible? — , Nov 21 '10 at 18:09
@delnan: I already use caching where possible and appropriate, so I do not see any ways of saving distance computations. — Björn Pollex, Nov 21 '10 at 18:17
@delnan: `datasample` is a list of `floats`, and `var` is the variance function from numpy. — Björn Pollex, Nov 21 '10 at 18:36
A little off-topic: You do realize that the expression `__call__()` returns is being calculated as though written like this `(u - v) ** (2 * self._norm)`? See the operator precedence table [here](http://docs.python.org/reference/expressions.html?highlight=operator%20precedence#summary). — martineau, Nov 21 '10 at 18:46
@martineau: No. Directly from my interpreter (2.6.5): `3**2*4` gives `36`, and `3**(2*4)` gives `6561`. This is consistent with what the link you provided describes. — Björn Pollex, Nov 21 '10 at 19:01
@Space_C0wb0y: My mistake, I see now that the table is ordered from lowest to highest precedence, not the other way around (the way I am used to seeing such information presented). Sorry. — martineau, Nov 22 '10 at 12:26

score 2 · Accepted Answer · 2010-11-21T19:31:15.263

2

The following cython code (I realize the first line of __init__ is different, I replaced it with random stuff because I don't know var and because it doesn't matter anyway - you stated __call__ is the bottleneck):

cdef class SquareErrorDistance:
    cdef double _norm

    def __init__(self, dataSample):
        variance = round(sum(dataSample)/len(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, double u, double v): # u and v are floats
        return (u - v) ** 2 * self._norm

Compiled via a simple setup.py (just the example from the docs with the file name altered), it performs nearly 20 times better than the equivalent pure python in a simple contrieved timeit benchmark. Note that the only changed were cdefs for the _norm field and the __call__ parameters. I consider this pretty impressive.

edited Nov 21 '10 at 19:31

answered Nov 21 '10 at 18:49

**THIS - IS - AMAZING**. Thank you so much. I can actually apply this (meaning Cython) to many other hotspots as well. You just made my day :) – Björn Pollex Nov 21 '10 at 19:09
1

@Space_C0wb0y: Always glad to help :) If you use numpy heavily, also have a look at http://docs.cython.org/src/tutorial/numpy.html. – Nov 21 '10 at 19:26
You may as well declare variance as a double too. It probably won't make much of a difference, but why not? – Justin Peel Nov 22 '10 at 03:27

score 0 · Answer 2 · answered Nov 21 '10 at 18:28

This probably won't help much, but you can rewrite it using nested functions:

def SquareErrorDistance(dataSample):
    variance = var(list(dataSample))
    if variance == 0:
        def f(u, v):
            x = u - v
            return x * x
    else:
        norm = 1.0 / (2 * variance)
        def f(u, v):
            x = u - v
            return x * x * norm
    return f

Suggestions on how to speed up a distance calculation

2 Answers2

Linked