Improve the speed of loop performance

Question

I am trying to build a sample for my Markov chain Monte Carlo code using pyMC. So with the sampled parameters of the model, each time the output is built by calling getLensing instance from nfw class and compared to the observed data. My problem is that my code is very slow, when it computed the model parameter. I have for instance 24000 data point and then for each of them I have a probability distribution- e.g. obj_pdf- which I marginalize over it (integrate) in the inner loop. so each time at least it takes an hour to compute all the outputs of the model.

import numpy as np
z=np.arange(0,1.5,0.001)
z_h=0.15
for j in range(pos.shape[0]):
    value1=0;value2=0
    pdf=obj_pdf[j,:]/sum(obj_pdf[j,:])
    for i in range(len(z)):
        if (z[i]>z_h) :
        g1,g2=nfw.getLensing( pos[j,:], z[i])
            value1+=g1*pdf[i]
            value2+=g2*pdf[i]
    if (j<1):
       value=np.array([value1,value2])
    else:
       value=np.vstack((value, np.array([value1,value2])))

so if I want to re-sample the input parameters for instance 100000 time, it would get months to do the MCMC calculation. Is there any smart way to speed up my code and loops? Do I need to use something like numpy.vectorize or still it won't improve the speed of my code? How about cython, would it increase the performance of the code? In case it helps, how does it work?

I ran python -m cProfile mycode.py to see what was caused that my code gets slow, and it was the results:

    12071    0.004    0.000    0.004    0.000 {min}
        2    0.000    0.000    0.000    0.000 {next}
        1    0.000    0.000    0.000    0.000 {numexpr.interpreter._set_num_threads}
        8    0.002    0.000    0.002    0.000 {numpy.core.multiarray.arange}
132424695  312.210    0.000  312.210    0.000 {numpy.core.multiarray.array}
    73498    3.933    0.000    3.933    0.000 {numpy.core.multiarray.concatenate}
 99151506  201.497    0.000  201.497    0.000 {numpy.core.multiarray.copyto}
 99151500  164.303    0.000  164.303    0.000 {numpy.core.multiarray.empty_like}
       28    0.000    0.000    0.000    0.000 {numpy.core.multiarray.empty}
        2    0.000    0.000    0.000    0.000 {numpy.core.multiarray.set_string_function}
        1    0.000    0.000    0.000    0.000 {numpy.core.multiarray.set_typeDict}
        1    0.000    0.000    0.000    0.000 {numpy.core.multiarray.where}
       14    0.000    0.000    0.000    0.000 {numpy.core.multiarray.zeros}
       14    0.000    0.000    0.000    0.000 {numpy.core.umath.geterrobj}
        7    0.000    0.000    0.000    0.000 {numpy.core.umath.seterrobj}
      270    0.000    0.000    0.000    0.000 {numpy.lib._compiled_base.add_docstring}
        6    0.011    0.002    0.011    0.002 {open}
        1    0.000    0.000    0.000    0.000 {operator.div}
        2    0.000    0.000    0.000    0.000 {operator.mul}
     1918    0.000    0.000    0.000    0.000 {ord}
        2    0.000    0.000    0.000    0.000 {posix.WEXITSTATUS}
        2    0.000    0.000    0.000    0.000 {posix.WIFEXITED}
        1    0.000    0.000    0.000    0.000 {posix.WIFSIGNALED}
        9    0.002    0.000    0.002    0.000 {posix.access}
        3    0.000    0.000    0.000    0.000 {posix.close}
        5    0.002    0.000    0.002    0.000 {posix.fdopen}
        1    0.002    0.002    0.002    0.002 {posix.fork}
        4    0.000    0.000    0.000    0.000 {posix.getcwd}
        6    0.000    0.000    0.000    0.000 {posix.getpid}
        1    0.000    0.000    0.000    0.000 {posix.getuid}
        1    0.000    0.000    0.000    0.000 {posix.listdir}
        6    0.000    0.000    0.000    0.000 {posix.lstat}
        4    0.043    0.011    0.043    0.011 {posix.open}
        2    0.000    0.000    0.000    0.000 {posix.pipe}
        2    0.004    0.002    0.004    0.002 {posix.popen}
        1    0.007    0.007    0.007    0.007 {posix.read}
      205    0.059    0.000    0.059    0.000 {posix.stat}
        3    0.000    0.000    0.000    0.000 {posix.sysconf}
        2    0.000    0.000    0.000    0.000 {posix.uname}
        4    0.004    0.001    0.004    0.001 {posix.unlink}
        3    0.000    0.000    0.000    0.000 {posix.urandom}
        1    0.000    0.000    0.000    0.000 {posix.waitpid}
        1    0.000    0.000    0.000    0.000 {pow}
     1522    0.004    0.000    0.004    0.000 {range}
       73    0.000    0.000    0.000    0.000 {repr}
 99151501 2102.879    0.000 6380.906    0.000 {scipy.integrate._quadpack._qagse}
     1776    0.002    0.000    0.002    0.000 {setattr}
       32    0.000    0.000    0.000    0.000 {sorted}
    24500   18.861    0.001   18.861    0.001 {sum}
      184    0.000    0.000    0.000    0.000 {sys._getframe}
        1    0.000    0.000    0.000    0.000 {sys.getfilesystemencoding}
        2    0.000    0.000    0.000    0.000 {sys.settrace}
        1    0.000    0.000    0.000    0.000 {tables.utilsextension._broken_hdf5_long_double}
        1    0.000    0.000    0.000    0.000 {tables.utilsextension.blosc_compressor_list}
        2    0.000    0.000    0.000    0.000 {tables.utilsextension.get_hdf5_version}
        1    0.000    0.000    0.000    0.000 {tables.utilsextension.get_pytables_version}
        2    0.000    0.000    0.000    0.000 {tables.utilsextension.which_lib_version}
       27    0.000    0.000    0.000    0.000 {thread.allocate_lock}
        6    0.000    0.000    0.000    0.000 {thread.get_ident}
        4    0.000    0.000    0.000    0.000 {thread.start_new_thread}
        1    0.000    0.000    0.000    0.000 {time.localtime}
        2    0.000    0.000    0.000    0.000 {time.time}
      105    0.000    0.000    0.000    0.000 {unichr}
      229    0.000    0.000    0.000    0.000 {vars}
    49300    2.127    0.000    2.127    0.000 {zip}

Is your code spending most of the time in one of the functions called in the loop or really by executing the loop instructions? If it is the former, then Python's loop performance is not your bottleneck. In order to properly identify your bottleneck, you need to profile your code and see in which of the inner functions most of the time is spent. — Dr. Jan-Philip Gehrcke, Jun 05 '14 at 22:30
Run the profiler on your code to see `tottime` of your functions. `python -m cProfile myprogram.py` — woot, Jun 05 '14 at 22:31
@Jan-PhilipGehrcke I think my spends most of the time integrate over the **pdf** and it makes it very slow. However the other function which is called, is also a separate class and called in inner loop as well. — Dalek, Jun 05 '14 at 22:34
@Jan-PhilipGehrcke If I substitute the inner loop with for example `numpy.vectorize` increase the speed of the code? — Dalek, Jun 05 '14 at 23:07
What are the dimensions of your arrays? What is pos.shape, what is obj_pdf.shape? — Mathias, Jun 06 '14 at 07:14
@ Mathias `obj_pdf.shape` is (24500,1500) array. and `pos.shape' is (1,2) array. — Dalek, Jun 06 '14 at 07:20
I assume those are only test-case dimensions? I expected pos.shape[0] to be equal to obj_pdf.shape[0]. With pos.shape[0]==1 you can remove the first loop, although that doesn't increase the speed either. I guess using multiple cores is the easiest way to get a ~4x speedup (if you have a quadcore). — Mathias, Jun 06 '14 at 08:08
@Mathias you are right. The whole shape of the `pose` array is (24500,2). — Dalek, Jun 06 '14 at 08:16
Profile!! Don't say you think it spends most of your time in X, just run your code in a profiler! This is a cheap and easy way to understand where the bottlenecks actually are. Follow @woot's instructions. — nneonneo, Jun 06 '14 at 15:39
@nneonneo I ran CProfile with my code the output is given in the edited version of my question. — Dalek, Jun 08 '14 at 16:17
@Dalek: The obvious culprit is `_qagse`, which is called by `integrate.quad`. Since that's not in your code, I have to assume it's in PyMC, which means the only way you are going to improve your performance significantly is to not call PyMC so often. — nneonneo, Jun 08 '14 at 16:41

score 0 · Answer 1 · edited May 23 '17 at 12:20

Here is some code. I'd be amazed if the timings went from 60 to 59 minutes though.

import numpy as np
z_h=0.15
z=np.arange(z_h, 1.5,0.001) #start the range from what you need (not exactly
z=z[1:] # needed because you said if (z[i]>z_h), range gives (z[i]>=z_h)

value=np.array([])

for j in range(pos.shape[0]):
    value1=0;value2=0
    pdf=obj_pdf[j,:]/sum(obj_pdf[j,:])
    posj=pos[j,:] #precalculate 
    for i,zi in enumerate(z): #use enumerate if you need value and index
        g1,g2=nfw.getLensing( posj, zi)
        value1+=g1*pdf[i]
        value2+=g2*pdf[i]
    value=np.append(value, np.array([value1,value2])) # use a proper append function

Like the others, I assume getLensing is eating up your CPU cycles.

According to the first answer to this, np.vectorize will not speed up your function.

Like I said, I didn't expect it to. Your problem will be getLensing. — Mathias, Jun 06 '14 at 07:51

Improve the speed of loop performance

1 Answers1