0

I am trying to reduce the computation time of my script,which is run with pypy. It has to calculate for a large number of lists/vectors/arrays the pairwise sums of absolute differences. The length of the input vectors is quite small, between 10 and 500. I tested three different approaches so far:

1) Naive approach, input as lists:

def std_sum(v1, v2):
distance = 0.0
for (a,b) in izip(v1, v2):
     distance += math.fabs(a-b)
 return distance

2) With lambdas and reduce, input as lists:

lzi = lambda v1, v2: reduce(lambda s, (a,b):s + math.fabs(a-b), izip(v1, v2), 0)
def lmd_sum(v1, v2):
    return lzi(v1, v2)

3) Using numpy, input as numpy.arrays:

def np_sum(v1, v2):
    return np.sum(np.abs(v1-v2))

On my machine, using pypy and pairs from itertools.combinations_with_replacement of 500 such lists, the first two approaches are very similar (roughly 5 seconds), while the numpy approach is significantly slower, taking around 12 seconds.

Is there a faster way to do the calculations? The lists are read and parsed from text files and an increased preprocessing time would be no problem (such as creating numpy arrays). The lists contain floating point numbers and are of equal size which is known beforehand.

The script I use for ''benchmarking'' can be found here and some example data here.

feob
  • 1,930
  • 5
  • 19
  • 31
  • Are you using NumPyPy? – Veedrac Jun 01 '14 at 18:43
  • To be honest, I'm not sure, I have to investigate. However, the performance difference is also present when I run the 'benchmark' with standard python. – feob Jun 01 '14 at 18:46
  • Can you post your timing code? Can you also try `sum(abs(a-b) for a, b in izip(v1, v2))`? I have a feeling that the problem could well be using `itertools` instead of a pure-Numpy solution. – Veedrac Jun 01 '14 at 18:51
  • When I launch the pypy shell, import numpy and do numpy.__file__ it does point to the pypy-numpy fork I installed from https://bitbucket.org/pypy/numpy, so I guess I am really using numpypy. – feob Jun 01 '14 at 19:07
  • Just have a look on this [benchmark](http://nbviewer.ipython.org/github/rasbt/One-Python-benchmark-per-day/blob/master/ipython_nbs/day7_2_jit_numpy.ipynb?create=1). I know it is not pypy, but those JIT-compilers are especially made for numerical calculations... So maybe you can get your speedup using numba, parakeet or numexpr ... – koffein Jun 01 '14 at 19:17

1 Answers1

3

Is there a faster way to do the calculations? The lists are read and parsed from text files and an increased preprocessing time would be no problem (such as creating numpy arrays). The lists contain floating point numbers and are of equal size which is known beforehand.

PyPy is very good at optimizing list accesses, so you should probably stick to using lists.

One thing that will help PyPy optimize things is to make sure your lists always have only one type of objects. I.e. if you read strings from a file, don't put them in a list, then parse them into floats in-place. Rather, create the list with floats, for example by parsing each string as soon as it is read. Likewise, never try to preallocate a list, especially with [None,]*N, or PyPy will not be able to guess that all the elements have the same type.

Second, iterate the list as few times as possible. Your np_sum function walks both arrays three times (subtract, abs, sum) unless PyPy notices and can optimize it. Both 1. and 2. walk the list once, so they are faster.

otus
  • 5,572
  • 1
  • 34
  • 48
  • In my actual script, I parse the list like this: `[float(d) for d in data.split(',') ]` Is this sufficient? – feob Jun 01 '14 at 19:17
  • @feob, yes, that should do it. Something like `s = data.split(); for i in range(len(s)): s[i] = float(s[i])` *seems* like it would be an optimization, but would actually be a pessimization. – otus Jun 01 '14 at 19:55