0

I'm reading the documentation on dask.distributed and it looks like I could submit functions to the distributed cluster via client.submit().

I have an existing function some_func that is grabbing individual documents (say, a text file) asynchronously and I want to take the raw document and grab all words that don't contain a vowel and shove it back into a different database. This data processing step is blocking.

Assuming that there are several million documents and the distributed cluster only has 10 nodes with 1 process available (i.e., it can only process 10 documents at a time), how will dask.distributed handle the flow of the documents that it needs to process?

Here is some example code:

client = dask.distributed('tcp://1.2.3.4:8786')

def some_func():
    doc = retrieve_next_document_asynchronously() 
    client.submit(get_vowelless_words, doc)

def get_vowelless_words(doc):
    vowelless_words = process(doc)
    write_to_database(vowelless_words)

if __name__ == '__main__':
    for i in range(1000000):
        some_func()

Since the processing of a document is blocking and the cluster can only handle 10 documents simultaneously, what happens when 30 other documents are retrieved while the cluster is busy? I understand that client.submit() is asynchronous and it would return a concurrent future but what would happen in this case? Would it hold the document in memory until it 1/10 cores are available and potentially cause the machine to run out of memory after, say, if 1,000 documents are waiting.

What would the scheduler do in this case? FIFO? Should I somehow change the code so that it waits for a core to be available before retrieving the next document? How might that be accomplished?

slaw
  • 6,591
  • 16
  • 56
  • 109

2 Answers2

1

To use Queues with dask, below is a modified example of using dask Queues with a distributed cluster (based on the documentation):

#!/usr/bin/env python

import distributed
from queue import Queue
from threading import Thread

client = distributed.Client('tcp://1.2.3.4:8786')
nprocs = len(client.ncores())

def increment(x):
    return x+1

def double(x):
    return 2*x

input_q = Queue(maxsize=nprocs)
remote_q = client.scatter(input_q)
remote_q.maxsize = nprocs
inc_q = client.map(increment, remote_q)
inc_q.maxsize = nprocs
double_q = client.map(double, inc_q)
double_q.maxsize = nprocs
result_q = client.gather(double_q)

def load_data(q):
    i = 0
    while True:
        q.put(i)
        i += 1

load_thread = Thread(target=load_data, args=(input_q,))
load_thread.start()

while True:
    size = result_q.qsize()
    item = result_q.get()
    print(item, size)

In this case, we explicitly limit the maximum size of each queue to be equal to the number of distributed processes that are available. Otherwise, the while loop will overload the cluster. Of course, you can adjust the maxsize to be some multiple of the number of available processes as well. For simple functions like increment and double, I found that maxsize = 10*nprocs is still reasonable but this will surely be limited by the amount of time that it takes to run your custom function.

slaw
  • 6,591
  • 16
  • 56
  • 109
0

When you call submit all of the arguments are serialized and immediately sent to the scheduler. An alternative would be to both get documents and process them on the cluster (this assumes that documents are globally visible from all workers).

for fn in filenames:
    doc = client.submit(retrieve_doc, fn)
    process = client.submit(process_doc, doc)
    fire_and_forget(process)

If documents are only available on your client machine and you want to restrict flow then you might consider using dask Queues or the as_completed iterator.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Is `dask,distributed` compatible with asyncio.queue? The items that I am putting into the queue are http responses coming from aiohttp. While the http requests are asynchronous, I think that the response that is `put` into the `queue.queue` is synchronous. With aiohttp responses that return a large json file, sending this response to a dask-worker (via `q.put(response['json'])`) is slow and blocking. Is there a way to make it ascynchronous? – slaw Oct 14 '18 at 01:13
  • You can run dask in async mode. It runs on the tornado event loop (which as of today is also the asyncio event loop) See https://distributed.dask.org/en/latest/asynchronous.html – MRocklin Oct 15 '18 at 12:03