Automatically adding a dataset to Dask scheduler on startup

Question

TL;DR
I want to pre-load a dataset into the Dask Distributed scheduler when it's starting up.

Background
I'm using Dask in a realtime query fashion with a smaller-then-memory dataset. Because it's realtime it's important that the workers can trust that the scheduler always has certain datasets available - even immediately after startup. The workers hold the entire dataset in memory at all time.

Traditionally I've done this by connecting a client, scattering a df and publishing a dataset:

df = dd.read_parquet('df.parq')
df = client.persist(df)
client.publish_dataset(flights=dfa)

But this leaves the possibility of the scheduler restarting and the dataset not being loaded.

I know that you can use --preload to execute a script on startup, like so:

dask-scheduler --preload=scheduler-startup.py

And that the boiler plate code looks like this:

from distributed.diagnostics.plugin import SchedulerPlugin

class MyPlugin(SchedulerPlugin):
    def add_worker(self, scheduler=None, worker=None, **kwargs):
        print("Added a new worker at", worker)

def dask_setup(scheduler):
    plugin = MyPlugin()
    scheduler.add_plugin(plugin)

But how can I convince the scheduler to load my dataset without using an external client?

In theory I could possibly drop a subprocess that starts up the client that prepopulates, but it feels less then ideal :)

Normal client in scheduler startup
Trying to connect as a client in the scheduler startup:

from distributed.diagnostics.plugin import SchedulerPlugin
from dask.distributed import Client

class MyPlugin(SchedulerPlugin):
    def add_worker(self, scheduler=None, worker=None, **kwargs):
        print("Added a new worker at", worker)

def dask_setup(scheduler):
    c = Client(scheduler.address)
    df = dd.read_parquet('df.parq')
    df = c.persist(df)
    c.publish_dataset(flights=dfa)

Hangs at c = Client(scheduler.address) and has to be force killed (kill -9)

What happens if you put the client code that you posed in your startup script? — MRocklin, Sep 28 '17 at 14:17
It hangs indefinitely (probably in a recursive loop where the client tries to connect to the scheduler which is not yet started). The hung process can't be Ctrl-C'ed but has to be kill -9'ed — Niklas B, Sep 28 '17 at 18:21

score 0 · Answer 1 · answered Sep 29 '17 at 12:37

You might consider adding your client code in an asynchronous function that runs on the event loop. This will allow the preload script to finish, let the scheduler start up, and then run your client code. You might want something like the following:

async def f(scheduler):
    client =  await Client(scheduler.address)
    df = dd.read_parquet(...)
    await client.publish_dataset(flights=df)

def dask_setup(scheduler):
    scheduler.loop.add_callback(f, scheduler)

score 0 · Answer 2 · answered Oct 02 '17 at 15:34

@MRocklin's answer got me on the right path, I did however need to drop to another Thread:

from concurrent.futures import ThreadPoolExecutor

def load_dataset():
    client = Client('127.0.0.1:8786')
    df = dd.read_parquet(...)
    df = client.persist(df)
    client.publish_dataset(flights=df)

async def f(scheduler):
    executor = ThreadPoolExecutor(max_workers=1)
    executor.submit(load_dataset)

def dask_setup(scheduler):
    scheduler.loop.add_callback(f, scheduler)

The downside is that it doesn't stop the workers from connecting while it's loading data, but I think that will have to be managed on the worker side (retry if dataset is not available)

Automatically adding a dataset to Dask scheduler on startup

2 Answers2