7

In airflow.cfg there is a section called [operators], where default_cpus was set to 1 and default_ram and default_disk were both set to 512.

I would like to understand whether will I get improvements in processing speed if I increase these parameters or not.

stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
V.Yan
  • 109
  • 7

1 Answers1

9

I took a look at the sources and these settings are available to all operators, but they are never used, neither by operators nor by any executor.

So I went a little bit back into history and had a look at the commit that introduced those settings and they are, quoting the JIRA ticket that lead to that PR:

optional resource requirements for use with resource managers such as yarn and mesos

The Mesos executor, however, is a community contribution that does not leverage this properties and just assigns the same amount of resources to every task, and the YARN executor is not there yet AFAIK (as of version 1.9).

I once had a discussion with the Airflow team to understand if there was a way to assign resources on a per task basis using the Mesos executor and they replied me with their strategy to assign resources to tasks using the Celery executor, in case it may be of help to you to understand how to manage resources.

Regarding the core question that you are asking in a more general sense, the kind of throughput that you can get out of a task in relation with the resources it gets assigned, depends a lot on the task itself: of course a very compute-intensive task that can leverage multiple processors will see speed bumps if you assign it multiple cores, while an I/O intensive task (like copying data between different systems) will probably not see much improvement.

stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
  • 1
    so can i assume that for `LocalExecutor`, `default_cpus` settings is ignored? In that case **max no of tasks running at any moment** (assuming only 1 `DAG` is running) should be entirely dependent on `dag_concurrency`. Actually my `scheduler` is not running more than 7-8 tasks at a time so i thought maybe it had something to do with `cpus`. I'm using `Airflow` as **pure scheduler** (all tasks run on *remote systems* over `SSH` or something similar), so i always thought in this setting *sky would be the limit* to how many concurrent tasks can be run. (Its a 4 vCore box `r4.xlarge` on `EMR`) – y2k-shubham Jan 30 '19 at 16:46
  • I would not assume that, especially considering that this reply is not very fresh. My suggestion is to go have a look at the code, if the documentation is not clear in this regard. – stefanobaghino Jan 30 '19 at 19:11