3

I am trying to rewrite an entire project that has been developed with classes. Little by little, the heaviest computational chunks should be parallelized, clearly we have a lot of independent sequential loops. An example with classes that mimicks the behaviour is this toy problem (I'm a mathematician obsessed with the p-sums):

class Summer:
    def __init__(self, p):
        self.p = p
    def sum(self):
        return sum(pow(i,-self.p) for i in range(1,1000000))

total = sum([Summer(p).sum() for p in range(2,20)])

If I replace the last line by:

from dask.distributed import Client
def psum(p): return Summer(p).sum()
client = Client()
A = client.map(psum,range(2,20))
total=client.submit(sum,A).result()

My runtime is cut in 4 (the number of cores available on my machine). This ideal behaviour does NOT persist if I use my real classes which are data intensive (big pandas structures taking up memory). Is there a recommended alternative to dask.distributed? I'm seeing bad slowdowns which I attribute to data being passed around.

Sergio Lucero
  • 862
  • 1
  • 12
  • 21
  • 1
    You may want to read http://distributed.readthedocs.io/en/latest/efficiency.html and http://distributed.readthedocs.io/en/latest/diagnosing-performance.html – MRocklin Dec 19 '17 at 13:46
  • 1
    Great, so I've followed your advice and ran the profilers (on a sequential version) which shows that I am effectively using several gigs of memory just building my objects, and it gets only worse after that. My feeble attempt at applying the technique above to the real problem fails because my class generates TypeError: can't pickle thread.lock objects – Sergio Lucero Dec 19 '17 at 16:36
  • This I solved by upgrading dask and pandas and using dask.bag – Sergio Lucero Dec 19 '17 at 16:52

0 Answers0