1

I've got a function that effectively generates an image and stores in to the disk. The function has no arguments:

def generate_and_save():
    pass # generate and store image

I need to generate a large number of images (say 100k), so I opt for Dask. Having read up the manuals, I've put together a function that creates a distributed (local) client and executes the task with several processes, like so:

from dask.distributed import Client as DaskClient

def generate_images(how_many=10000, processes=6):
    # start Dask distributed client with 1 thread per process
    client = DaskClient(n_workers=processes, threads_per_worker=1)
    # submit future functions to cluster
    futures = []
    for i in range(how_many): 
        futures.append(client.submit(generate_and_save, pure=False))
    # execute and compute results (synchronous / blocking!)
    results = client.gather(futures)
    print(len(results))
    # stop & release client
    client.close()

generate_images(50000)

As you see, the 'futures' are submitted to the server in a for loop and stored in a simple list. The question is: is there a more efficient way of adding and executing the futures in this case? Like, e.g., parallelizing the submitting procedure itself?

user3666197
  • 1
  • 6
  • 50
  • 92
s0mbre
  • 361
  • 2
  • 14
  • How long does it take to run a one-shot call of the **`generate_and_save()`** in **`[us]`** and how many **`[MB]`** gets written on what kind of a storage? May use `from zmq import Stopwatch; aClk = Stopwatch();aClk.start();generate_and_save();aClk.stop()` which returns a such call duration in `[us]` ( better run in battery-test ~ 1k sample, posting ( min, Avg, MAX, StDev ) `[us]` to better reflect the nature and discriminate outliers in the sample ( also a complete L2/L3-cache eviction enforcement is doable inside a such testing-template ) – user3666197 Sep 13 '19 at 04:42
  • Haven't tested that, but I can say offhand that one run of the worker takes around 0.1 sec, writing an image of around 200 kB. Is that relevant to my question? – s0mbre Sep 14 '19 at 10:50

1 Answers1

2

Nope. This looks pretty good. I wouldn't expect the overhead to take too long, probably somewhere under 1ms per task, so 10s

If this overhead is a long time, then you might want to read this doc section: https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

MRocklin
  • 55,641
  • 23
  • 163
  • 235