5

I’m working on an ETL project using Azure Functions where I extract data from blob storage, transform the data in Python and pandas, and load the data using pandas to_sql(). I’m trying to make this process more efficient by using asyncio and language workers.

I’m a little confused because I was under the impression that asyncio works using one thread, but the Azure Functions documentation says you can use multiple language workers if you change your config and that even a method that doesn’t use the async keyword runs in a thread pool.

Does that mean that if I don’t use the async keyword my methods will run concurrently using language workers? Do I have to use asyncio to utilize language workers?

Also, the documentation says that Azure Functions can scale to up to 200 instances. How can I scale to that many instances if I’m only allowed a maximum of 10 language workers?

Edit: Thanks Anatoli. Just to clarify, if I have a Timer Trigger with the following code:

import azure.functions as func
from . import client_one_etl
from . import client_two_etl

def main(mytimer: func.TimerRequest) -> None:
    client_one_etl.main()
    client_two_etl.main()

If I have increased the number of language workers, does that mean both client_one_etl.main() and client_two_etl.main() are automatically run in separate threads even without using asyncio? And if client_two_etl.main() needs client_one_etl.main() to finish before executing, I will need to use async await to prevent them from running concurrently?

And for the separate instances, if client_one_etl.main() and client_two_etl.main() do not rely on each other does that mean I can execute them in one Azure Function app as separate .py scripts that run in their own VM? Is it possible to run multiple Timer Triggers (calling multiple __init__.py scripts each in their own VM for one Azure Function)? Then all scripts will need to complete within 10 minutes if I increase functionTimeout in the host.json file?

ddx
  • 469
  • 2
  • 9
  • 27

1 Answers1

17

FUNCTIONS_WORKER_PROCESS_COUNT limits the maximum number of worker processes per Functions host instance. If you set it to 10, each host instance will be able to run up to 10 Python functions concurrently. Each worker process will still execute Python code on a single thread, but now you have up to 10 of them running concurrently. You don't need to use asyncio for this to happen. (Having said that, there are legitimate reasons to use asyncio to improve scalability and resource utilization, but you don't have to do that to take advantage of multiple Python worker processes.)

The 200 limit applies to the number of Functions host instances per Function app. You can think of these instances as separate VMs. The FUNCTIONS_WORKER_PROCESS_COUNT limit is applied to each of them individually, which brings the total number of concurrent threads to 2000.

UPDATE (answering the additional questions):

As soon as your function invocation starts on a certain worker, it will run on this worker until complete. Within this invocation, code execution will not be distributed to other worker processes or Functions host instances, and it will not be automatically parallelized for you in any other way. In your example, client_two_etl.main() will start after client_one_etl.main() exits, and it will start on the same worker process, so you will not observe any concurrency, regardless of the configured limits (unless you do something special in client_*_etl.main()).

When multiple invocations happen around the same time, these invocations may be automatically distributed to multiple workers, and this is where the limits mentioned above apply. Each invocation will still run on exactly one worker, from start to finish. In your example, if you manage to invoke this function twice around the same time, each invocation can get its own worker and they can run concurrently, but each will execute both client_one_etl.main() and client_two_etl.main() sequentially.

Please also note that because you are using a timer trigger on a single function, you will not experience any concurrency at all: by design, timer trigger will not start a new invocation until the previous invocation is complete. If you want concurrency, either use a different trigger type (for example, you can put a queue message on timer, and then the function triggered by the queue can scale out to multiple workers automatically), or use multiple timer triggers with multiple functions, like you suggested.

If what you actually want is to run independent client_one_etl.main() and client_two_etl.main() concurrently, the most natural thing to do is to invoke them from different functions, each implemented in a separate __init__.py with its own trigger, within the same or different Function apps.

functionTimeout in host.json is applied per function invocation. So, if you have multiple functions in your app, each invocation should complete within the specified limit. This does not mean all of them together should complete within this limit (if I understood your question correctly).

UPDATE 2 (answering more questions):

@JohnT Please note that I'm not talking about the number of Function apps or ___init___.py scripts. A function (described by ___init___.py) is a program that defines what needs to be done. You can create way more than 10 functions per app, but don't do this to increase concurrency - this will not help. Instead, add functions to separate logically-independent and coherent programs. Function invocation is a process that actively executes the program, and this is where the limits I'm talking about apply. You will need to be very clear on the difference between a function and a function invocation.

Now, in order to invoke a function, you need a worker process dedicated to this invocation until this invocation is complete. Next, in order to run a worker process, you need a machine that will host this process. This is what the Functions host instance is (not a very accurate definition of Functions host instance, but good enough for the purposes of this discussion). When running on Consumption plan, your app can scale out to 200 Functions host instances, and each of them will start a single worker process by default (because FUNCTIONS_WORKER_PROCESS_COUNT = 1), so you can run up to 200 function invocations simultaneously. Increasing FUNCTIONS_WORKER_PROCESS_COUNT will allow each Functions host instance create more than one worker process, so up to FUNCTIONS_WORKER_PROCESS_COUNT function invocations can be handled by each Functions host instance, bringing the potential total to 2000.

Please note though that "can scale out" does not necessarily mean "will scale out". For more details, see Azure Functions scale and hosting and Azure Functions limits.

Anatoli Beliaev
  • 1,614
  • 11
  • 13
  • @JohnT Answered. – Anatoli Beliaev May 10 '20 at 05:27
  • Thank you, @Anatoli! Looks like I'll have to do some more research and experimenting, but I think figuring out how to get different ```__init__.py``` scripts to run within the same Function app seems like the best way to go. I really appreciate your help on this. – ddx May 10 '20 at 19:27
  • One last question, what do you mean by: "unless you do something special in ```client_*_etl.main()```" In the actual scripts, each ```client_*_etl.main()``` makes call to several ```transform()``` methods that do not depend on each other. So it makes sense to execute each ```client_*_etl.main()``` script as a separate invocation to achieve concurrency, but if I want to achieve concurrency in a single ```.py``` script how can I achieve this by doing something special? Do you mean using a multiprocessing Python library? – ddx May 10 '20 at 20:04
  • 1
    By "doing something special in `client_*_etl.main()`" I mean starting the work and exiting before the work is completed. This could be done by spawning additional threads or processes, or starting a job on an external service. In any case, you would have to write special code to make this happen. I'm definitely _not_ suggesting this is the right thing to do in your case, just mentioned this as a technical possibility. – Anatoli Beliaev May 10 '20 at 21:01
  • sorry @Anatoli, one more thing. For my timer trigger example, does increasing ```FUNCTIONS_WORKER_PROCESS_COUNT``` to 10 mean I can run 10 separate ```___init___.py``` scripts concurrently? Or can I run 200 ```___init.py___``` scripts as different Functions host instances? Still a little hazy on the difference between worker processes and host instances. – ddx May 10 '20 at 23:48
  • After re-reading your answer, it seems like I can have 10 ```___init___.py``` scripts per Azure Function and my organization as a whole is allowed 200 separate Azure Function apps? – ddx May 11 '20 at 00:35
  • 1
    @JohnT See UPDATE 2 – Anatoli Beliaev May 11 '20 at 22:42
  • In my simple example, can I think of distinct ``__init.py__`` scripts as functions and the timer trigger is what invokes the function (potentially creating up to 200 host instances if the processes are separate)? – ddx May 12 '20 at 05:04
  • 1
    @JohnT Yes, you can. – Anatoli Beliaev May 12 '20 at 06:28