10

I've used Joblib and Airflow in the past and haven't run into this issue. I'm trying to run a job through Airflow that runs a parallel computation using Joblib. When the Airflow job starts up I see the following warning

UserWarning: Loky-backed parallel loops cannot be called in multiprocessing, setting n_jobs=1

Tracing the warning back to the source I see the following function triggering in the joblib package in the LokyBackend class (similar logic is also in the MultiprocessingBackend class)

def effective_n_jobs(self, n_jobs):
    """Determine the number of jobs which are going to run in parallel"""
    if n_jobs == 0:
        raise ValueError('n_jobs == 0 in Parallel has no meaning')
    elif mp is None or n_jobs is None:
        # multiprocessing is not available or disabled, fallback
        # to sequential mode
        return 1
    elif mp.current_process().daemon:
        # Daemonic processes cannot have children
        if n_jobs != 1:
            warnings.warn(
                'Loky-backed parallel loops cannot be called in a'
                ' multiprocessing, setting n_jobs=1',
                stacklevel=3)
        return 1

The issue is that I've run a similar function in Joblib and Airflow before and didn't trigger this condition to set n_jobs equal to 1. Wondering if this is some type of versioning issue (using Airflow 2.X and Joblib 1.X) or if there are settings in Airflow that can fix this. I looked at older versions of Joblib and even downgraded to Joblib 0.4.0 but that didn't solve any issues. I'm more hesitant to downgrade Airflow because of differences in the API, database connections, etc.


Edit:

Here is the code I've been running in Airflow:

def test_parallel():
    out=joblib.Parallel(n_jobs=-1, backend="loky")(joblib.delayed(lambda a: a+1)(i) for i in range(20))

with DAG("test", default_args=DEFAULT_ARGS, schedule_interval="0 8 * * *",) as test:
    run_test = PythonOperator(
        task_id="test",
        python_callable=test_parallel,
    )

    run_test

And the output in the airflow logs:

[2021-07-27 10:41:29,890] {logging_mixin.py:104} WARNING - /data01/code/virtualenv/alpha/lib/python3.8/site-packages/joblib/parallel.py:733 UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1

I launch airflow scheduler and airflow webserver via supervisor. However, even if I launch both airflow processes from the command line the issue still persists. It doesn't happen, however, when I just run the task via the airflow task API e.g. airflow tasks test run_test

vasadia
  • 366
  • 5
  • 8
Michael
  • 7,087
  • 21
  • 52
  • 81
  • I'm guessing here based on the executor source code I have seen, that it might also be relatable to your executor. Some executors use multiprocessing to start the job. I can imagine that conflicting with this code. – Jorrick Sleijster Jul 09 '21 at 14:50
  • I do see in the old logs that I'm using LocalExecutor (and apparently that was able to parallelize) {__init__.py:51} INFO - Using executor LocalExecutor – Michael Jul 09 '21 at 14:55
  • Perhaps in the past joblib just defaulted to the Threading backend and didn't log anything? – Michael Jul 09 '21 at 15:15
  • Can you provide more context about how you define the job with joblib? As far as I see [here](https://github.com/joblib/joblib/blob/754433f617793bc950be40cfaa265a32aed11d7d/joblib/parallel.py#L46), now `loky` is the default backend in joblib. Can you try switching that to `multiprocessing` or `threading`? – bruno-uy Jul 15 '21 at 12:00
  • 1
    The issue is 100% related to `Airflow` since `joblib`'s warning is triggered by the main process being daemonic, which essentially means that you are running your task with *systemd* or maybe there's some *worker* configuration that runs the processes as services rather than in the foreground. Would be a lot more useful if you gave more info about how you are running the main task that actually calls `joblib.Parallel()`. – Max Shouman Jul 17 '21 at 02:14
  • Ah makes sense, I'm running the Airflow process via supervisor. Perhaps there is some configuration in supervisor that would fix the issue? – Michael Jul 19 '21 at 17:53
  • Turns out even if I run "airflow scheduler" on the command line the issue is still persistent. That would mean that the main process isn't daemonic right? – Michael Jul 20 '21 at 17:27
  • You must mention exact version of `airflow` and `joblib` – Nizam Mohamed Aug 03 '21 at 17:51
  • apache-airflow[postgres]==2.1.1 joblib==1.0.1 – Michael Aug 03 '21 at 17:53
  • You can simply try by setting `mp.current_process().daemon=True` in `test_parallel`. – Nizam Mohamed Aug 03 '21 at 18:00
  • Of course `import multiprocessing as mp` – Nizam Mohamed Aug 03 '21 at 18:00
  • I said setting that to `False` – Nizam Mohamed Aug 03 '21 at 18:07
  • How do you run airflow webserver and scheduler? full command line. – Nizam Mohamed Aug 03 '21 at 18:37
  • I'm running it via supervisor but if I run it via the command line same issue. Just "airflow webserver" and "airflow scheduler" in a virtualenv – Michael Aug 03 '21 at 21:16
  • Setting the `mp.current_process().daemon=True` solves the issue, however then joblib seems to not be able to manage the processes anymore and all of the parallel processes are left as zombies. Any idea how to clean them up without manually killing every time? – Michael Aug 27 '21 at 14:32

2 Answers2

0

I notice that you didn't call the run_test function at the bottom of your code. Could that be the cause of any issues? Corrected version:

def test_parallel():
    out=joblib.Parallel(n_jobs=-1, backend="loky")(joblib.delayed(lambda a: a+1)(i) for i in range(20))

with DAG("test", default_args=DEFAULT_ARGS, schedule_interval="0 8 * * *",) as test:
    run_test = PythonOperator(
        task_id="test",
        python_callable=test_parallel,
    )

    run_test()
Red
  • 26,798
  • 7
  • 36
  • 58
0

So I solved this with switching from PythonOperator to BashOpertaor and joblib stopped to decrase number of cpus and threads to 1. Also I followed instruction from here just to kill daemonic processes after code exectuion, but you can just wait 300 seconds which is default joblib timeout for processes terminating.