-1

EDIT: My question was horrifically put so I delete it and rephrase entirely here. I'll give a tl;dr: I'm trying to assign each computation to a designated worker that fits the computation type. In long: I'm trying to run a simulation, so I represent it using a class of the form:

Class Simulation:
    def __init__(first_Client: Client, second_Client: Client)
        self.first_client = first_client
        self.second_client = second_client
    def first_calculation(input):
        with first_client.as_current():
            return output
    def second_calculation(input):
        with second_client.as_current():
            return output
    def run(input):
         return second_calculation(first_calculation(input))

This format has downsides like the fact that this simulation object is not pickleable. I could edit the Simulation object to contain only addresses and not clients for example, but I feel as if there must be a better solution. For instance, I would like the simulation object to work the following way:

Class Simulation:
    def first_calculation(input):
         client = dask.distributed.get_client()
         with client.as_current():
              return output
    ...

Thing is, the dask workers best fit for the first calculation, are different than the dask workers best fit for the second calculation, which is the reason my Simulation object has two clients that connect to tow different schedulers to begin with. Is there any way to make it so there is only one client but two types of schedulers and to make it so the client knows to run the first_calculation to the first scheduler and the second_calculation to the second one?

  • Welcome to stack overflow! See the guide to [ask]. Ideally, create a [mre], but but at the very least, you need to provide a code and the steps to reproduce the issue. It is very difficult to understand from your description exactly what is going on - keep in mind that dask is used in a wide range of deployment configurations and what may seem obvious to you may not at all be to us! Specifically, please at a minimum include information about your dask deployment (dask-kubernetes? LocalCluster? dask-jobqueue?) and provide your code and full tracebacks for errors you're receiving. – Michael Delgado Dec 05 '21 at 20:22
  • This might be somewhat useful: https://stackoverflow.com/a/66237721/10693596 – SultanOrazbayev Dec 07 '21 at 15:41

2 Answers2

1

Dask will chop up large computations in smaller tasks that can run in parallel. Those tasks will then be submitted by the client to the scheduler which in turn will schedule those tasks on the available workers.

Sending the client object to a Dask scheduler will likely not work due to the serialization issue you mention.

You could try one of two approaches:

  • Depending on how you actually run those worker machines, you could specify different types of workers for different tasks. If you run on kubernetes for example you could try to leverage the node pool functionality to make different worker types available.
  • An easier approach using your existing infrastructure would be to return the results of your first computation back to the machine from which you are using the client using something like .compute(). And then use that data as input for the second computation. So in this case you're sending the actual data over the network instead of the client. If the size of that data becomes an issue you can always write the intermediary results to something like S3.
jtlz2
  • 7,700
  • 9
  • 64
  • 114
matthiasdv
  • 1,136
  • 2
  • 15
  • 26
  • Not sure if this is a good idea, but instead of serializing the clients, perhaps they could connect using files written by the scheduler... – SultanOrazbayev Dec 07 '21 at 17:49
1

Dask does support giving specific tasks to specific workers with annotate. Here's an example snippet, where a delayed_sum task was passed to one worker and the doubled task was sent to the other worker. The assert statements check that those workers really were restricted to only those tasks. With annotate you shouldn't need separate clusters. You'll also need the most recent versions of Dask and Distributed for this to work because of a recent bug fix.

import distributed
import dask
from dask import delayed

local_cluster = distributed.LocalCluster(n_workers=2)
client = distributed.Client(local_cluster)

workers = list(client.scheduler_info()['workers'].keys())

with dask.annotate(workers=workers[0]):
    delayed_sum = delayed(sum)([1, 2])
    
with dask.annotate(workers=workers[1]):
    doubled = delayed_sum * 2

# use persist so scheduler doesn't clean up
# wrap in a distributed.wait to make sure they're there when we check the scheduler
distributed.wait([doubled.persist(), delayed_sum.persist()])

worker_restrictions = local_cluster.scheduler.worker_restrictions

assert worker_restrictions[delayed_sum.key] == {workers[0]}
assert worker_restrictions[doubled.key] == {workers[1]}
scj13
  • 306
  • 1
  • 5