Running two Tensorflow trainings in parallel using joblib and dask

Question

I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers.

I need to launch two processes, using the same dask client, where each will train their respective models with N workers.

To that end, I do the following:

I use joblib.delayed to spawn the two processes.
Within each process I run with joblib.parallel_backend('dask'): to execute the fit/training logic. Each training process triggers N dask workers.

The problem is that I don't know if the entire process is thread safe, are there any concurrency elements that I'm missing?

# First, submit the function twice using joblib delay
delayed_funcs = [joblib.delayed(train)(sub_task) for sub_task in [123, 456]]
parallel_pool = joblib.Parallel(n_jobs=2)
parallel_pool(delayed_funcs)

# Second, submit each training process
def train(sub_task):

    global client
    if client is None:
        print('connecting')
        client = Client()

    data = some_data_to_train

    # Third, process the training itself with N workers
    with joblib.parallel_backend('dask'):
        X = data[columns] 
        y = data[label]

        niceties = dict(verbose=False)
        model = KerasClassifier(build_fn=build_layers,
                loss=tf.keras.losses.MeanSquaredError(), **niceties)
        model.fit(X, y, epochs=500, verbose = 0)

Your question is rather open-ended. Do you have something more specific you have a problem with? Also, you mention launching processes, and then ask about thread-safety: is your dask worker or anything else using multiple threads? — mdurant, Dec 27 '21 at 02:33
I'm just trying to check if there are any race conditions in the code I posted. As @Sultan mentioned there may be a race condition when the client is created. I tried to create the client just once outside of the `train` function and pass it as a parameter, but I get an error saying that `joblib.parallel_backend('dask')` doesn't have a client defined — ps0604, Dec 27 '21 at 04:32

score 2 · Answer 1 · answered Dec 24 '21 at 05:12

2

This is pure speculation, but one potential concurrency issue is due to if client is None: part, where two processes could race to create a Client.

If this is resolved (e.g. by explicitly creating a client in advance), then dask scheduler will rely on time of submission to prioritize task (unless priority is clearly assigned) and also the graph (DAG) structure, there are further details available in docs.

answered Dec 24 '21 at 05:12

SultanOrazbayev

14,900
3
16
46

I moved the client creation logic outside the train method, and I get `ValueError: To use Joblib with Dask first create a Dask Client`, how to create the client in advance? – ps0604 Dec 24 '21 at 11:00
Hmm, according to the [example in the docs](https://joblib.readthedocs.io/en/latest/auto_examples/parallel/distributed_backend_simple.html#sphx-glr-auto-examples-parallel-distributed-backend-simple-py) `with joblib.parallel_backend('dask'):` should be outside the call to `joblib.Parallel`... – SultanOrazbayev Dec 24 '21 at 11:38
Sultan, your example works for a single process, I need to launch two processes, using the same dask client, where each will train their respective models with N workers. – ps0604 Dec 24 '21 at 12:11

mdurant · Accepted Answer · 2021-12-28T15:50:59.380

The question, as given, could easily be marked as "unclear" for SO. A couple of notes:

global client : makes the client object available outside of the fucntion. But the function is run from another process, you do not affect the other process when making the client
if client is None : this is a name error, your code doesn't actually run as written
client = Client() : you make a new cluster in each subprocess, each assuming the total resources available, oversubscribing those resources.
dask knows whether any client has been created in the current process, but that doesn't help you here

You must ask yourself: why are you creating processes for the two fits at all? Why not just let Dask figure out its parallelism, which is what it's meant for.

--

-EDIT-

to answer the form of the question asked in a comment.

My question is whether using the same client variable in these two parallel processes creates a problem.

No, the two client variables are unrelated to one-another. You may see a warning message about not being able to bind to a default port, which you can safely ignore. However, please don't make it global as this is unnecessary and makes what you are doing less clear.

--

I think I must answer the question as phrased in your comment, which I advise to add to the main question

I need to launch two processes, using the same dask client, where each will train their respective models with N workers.

You have the following options:

create a client with a specific known address within your program or beforehand, then connect to it
create a default client Client() and get its address (e.g., client._scheduler_identity['address']) and connect to that
write a scheduler information file with client.write_scheduler_file and use that

You will connect in the function with

client = Client(address)

or

client = Client(scheduler_file=the_file_you_wrote)

Thanks for your answer. Are you saying that I shouldn't use joblib to create the two processes? — ps0604, Dec 28 '21 at 03:36
It would be fair to say, that I don't know what having two processes achieves. However, that's how the question was given, so that's how I have tried to answer, — mdurant, Dec 28 '21 at 13:51
The intent of having two processes is to run two trainings in parallel. That's why I do `joblib.Parallel(n_jobs=2)` to trigger these two processes. Inside each process, I use `joblib.parallel_backend('dask')` to train one model. My question is whether using the same client variable in these two parallel processes creates a problem. — ps0604, Dec 28 '21 at 15:24

Running two Tensorflow trainings in parallel using joblib and dask

2 Answers2