6

Consider the following class:

class SquareErrorDistance(object):
    def __init__(self, dataSample):
        variance = var(list(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, u, v): # u and v are floats
        return (u - v) ** 2 * self._norm

I use it to calculate the distance between two elements of a vector. I basically create one instance of that class for every dimension of the vector that uses this distance measure (there are dimensions that use other distance measures). Profiling reveals that the __call__ function of this class accounts for 90% of the running-time of my knn-implementation (who would have thought). I do not think there is any pure-Python way to speed this up, but maybe if I implement it in C?

If I run a simple C program that just calculates distances for random values using the formula above, it is orders of magnitude faster than Python. So I tried using ctypes and call a C function that does the computation, but apparently the conversion of the parameters and return-values is far to expensive, because the resulting code is much slower.

I could of course implement the entire knn in C and just call that, but the problem is that, like I described, I use different distance functions for some dimension of the vectors, and translating these to C would be too much work.

So what are my alternatives? Will writing the C-function using the Python C-API get rid of the overhead? Are there any other ways to speed this calculation up?

aaronasterling
  • 68,820
  • 20
  • 127
  • 125
Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
  • I would suggest Cython (Answer with example implementation might come in a few minutes). I assume you algorithms are already as tuned as reasonably possible? –  Nov 21 '10 at 18:09
  • @delnan: I already use caching where possible and appropriate, so I do not see any ways of saving distance computations. – Björn Pollex Nov 21 '10 at 18:17
  • Well then... unrelated, what's `dataSample` and `var`? –  Nov 21 '10 at 18:21
  • @delnan: `datasample` is a list of `floats`, and `var` is the variance function from numpy. – Björn Pollex Nov 21 '10 at 18:36
  • A little off-topic: You do realize that the expression `__call__()` returns is being calculated as though written like this `(u - v) ** (2 * self._norm)`? See the operator precedence table [here](http://docs.python.org/reference/expressions.html?highlight=operator%20precedence#summary). – martineau Nov 21 '10 at 18:46
  • @martineau: No. Directly from my interpreter (2.6.5): `3**2*4` gives `36`, and `3**(2*4)` gives `6561`. This is consistent with what the link you provided describes. – Björn Pollex Nov 21 '10 at 19:01
  • @Space_C0wb0y: My mistake, I see now that the table is ordered from lowest to highest precedence, not the other way around (the way I am used to seeing such information presented). Sorry. – martineau Nov 22 '10 at 12:26

2 Answers2

2

The following cython code (I realize the first line of __init__ is different, I replaced it with random stuff because I don't know var and because it doesn't matter anyway - you stated __call__ is the bottleneck):

cdef class SquareErrorDistance:
    cdef double _norm

    def __init__(self, dataSample):
        variance = round(sum(dataSample)/len(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, double u, double v): # u and v are floats
        return (u - v) ** 2 * self._norm

Compiled via a simple setup.py (just the example from the docs with the file name altered), it performs nearly 20 times better than the equivalent pure python in a simple contrieved timeit benchmark. Note that the only changed were cdefs for the _norm field and the __call__ parameters. I consider this pretty impressive.

  • **THIS - IS - AMAZING**. Thank you so much. I can actually apply this (meaning Cython) to many other hotspots as well. You just made my day :) – Björn Pollex Nov 21 '10 at 19:09
  • 1
    @Space_C0wb0y: Always glad to help :) If you use numpy heavily, also have a look at http://docs.cython.org/src/tutorial/numpy.html. –  Nov 21 '10 at 19:26
  • You may as well declare variance as a double too. It probably won't make much of a difference, but why not? – Justin Peel Nov 22 '10 at 03:27
0

This probably won't help much, but you can rewrite it using nested functions:

def SquareErrorDistance(dataSample):
    variance = var(list(dataSample))
    if variance == 0:
        def f(u, v):
            x = u - v
            return x * x
    else:
        norm = 1.0 / (2 * variance)
        def f(u, v):
            x = u - v
            return x * x * norm
    return f
adw
  • 4,901
  • 1
  • 25
  • 18