TL;DR
I want to pre-load a dataset into the Dask Distributed scheduler when it's starting up.
Background
I'm using Dask in a realtime query fashion with a smaller-then-memory dataset. Because it's realtime it's important that the workers can trust that the scheduler always has certain datasets available - even immediately after startup. The workers hold the entire dataset in memory at all time.
Traditionally I've done this by connecting a client, scattering a df and publishing a dataset:
df = dd.read_parquet('df.parq')
df = client.persist(df)
client.publish_dataset(flights=dfa)
But this leaves the possibility of the scheduler restarting and the dataset not being loaded.
I know that you can use --preload
to execute a script on startup, like so:
dask-scheduler --preload=scheduler-startup.py
And that the boiler plate code looks like this:
from distributed.diagnostics.plugin import SchedulerPlugin
class MyPlugin(SchedulerPlugin):
def add_worker(self, scheduler=None, worker=None, **kwargs):
print("Added a new worker at", worker)
def dask_setup(scheduler):
plugin = MyPlugin()
scheduler.add_plugin(plugin)
But how can I convince the scheduler to load my dataset without using an external client?
In theory I could possibly drop a subprocess that starts up the client that prepopulates, but it feels less then ideal :)
Normal client in scheduler startup
Trying to connect as a client in the scheduler startup:
from distributed.diagnostics.plugin import SchedulerPlugin
from dask.distributed import Client
class MyPlugin(SchedulerPlugin):
def add_worker(self, scheduler=None, worker=None, **kwargs):
print("Added a new worker at", worker)
def dask_setup(scheduler):
c = Client(scheduler.address)
df = dd.read_parquet('df.parq')
df = c.persist(df)
c.publish_dataset(flights=dfa)
Hangs at c = Client(scheduler.address)
and has to be force killed (kill -9
)