4

I have my Azure function with CosmosDB trigger, that listens to a collection using lease collection mechanism. This function is hosted on consumption plan.

I have noticed that under heavy load I tend to get updates to my function with a greater and greater delay. After reading documentation I did not find a way how to improve scaling of my setup. Is there a way to?

Alexander Capone
  • 528
  • 3
  • 16

1 Answers1

5

Consumption Plan instances should grow based on how far behind your Function is lagging. If you are using Consumption Plan, if you are using App Service Plan, you can scale them yourself.

That being said, the current unit of work is based on the Partition Key value ranges. This means that, similar to Event Hub, the parallel processing has a soft limit based on your data distribution.

One way to detect this is to check your leases collection. If you see only one lease (disregard items with .info or .lock as their ids), that means your current data distribution yields one partition key value range, and only one instance can be processing that (not matter how many other instances get provisioned).

Logs can also show how scaling is behaving and how are instances picking up the different leases in case there are multiple.

Matias Quaranta
  • 13,907
  • 1
  • 22
  • 47
  • Could you please give a little bit more information or a link to read about this: >... the current unit of work is based on the Partition Key value ranges. This means that, similar to Event Hub, the parallel processing has a soft limit based on your data distribution. My partition key is a string value and as you said in leases collection I can see only one document (excluding .info). How can I increase a number of requested parallel readers? The lag at some point gets big (15 minutes) which is not acceptable for solution. Temp solution is to migrate to app plan, but its undesirable. – Alexander Capone May 23 '19 at 21:17
  • 1
    There are a couple of points worth addressing first: How long does your Function take to process 1 batch? Regardless of the scaling, this is possibly the first point to address (https://learn.microsoft.com/en-us/azure/azure-functions/functions-best-practices#avoid-long-running-functions). As for the amount of partition key values, is your collection partitioned? if so, how many RUs are provisioned and what size of data are you managing? – Matias Quaranta May 23 '19 at 21:32
  • Function takes ~150ms to execute. It is simple 'Get from cosmos,enrich, put to topic' kind of thing. I have 4000RUs. During load test I had 10K partiotions. In my design each aggregate root is a partiotion(as you can imagine, i am playing with event sourcing solution). – Alexander Capone May 23 '19 at 21:35
  • 1
    When you mention 10K partitions, do you mean partition key values? What's the rough amount of data? If the function is taking only 150ms per execution, is that taking into account that a Trigger batch can contain 100 documents? If at some point you are lagging behind 15 minutes, that means you are writing on the collection at more than 1 document / 1.5 ms (faster than what the Function can process) or that the Function takes longer on longer batches to send to the Topic (Event Grid?)? – Matias Quaranta May 23 '19 at 21:44
  • yes, 10K unique partiotion key values. Each document contains like 10 json fields. Imagine fields like Name, Time, etc. Some generic business logic data. Unfortunately i dont remember the exect execution time right now, but I remember from application insights that azure function Processing takes on avg 130ms. 1d/1.5ms=1200d/sec? Hm that is definitely not the load. I will check tomorrow and come back to you. I am writing to ServiceBus Topic. Actual load is something about 150 documents/second spread EVENLY across partition keys. Changing to app service plan with 4 instances made lag<1sec.. – Alexander Capone May 23 '19 at 21:56
  • You know what. I just realized.In Application isnights even on service plan i saw only 1 instance handling all the events, as you said in your original answer - one lease, one listener. As you write on Medium page - the issue can be in infrastructure capacity. My service plan is S3 and in processing of the events i do some LINQ, so maybe CPU jumps... I will test that tomorrow. FINALLY, I FEEL LIKE I AM MOVING IN CORRECT DIRECTION AND THERES NO MAGIC AGAIN. Thank you SO MUCH. – Alexander Capone May 23 '19 at 22:11
  • Consumption Plan instances are similar to S1 instances, so it could be resources exhaustion if it's consuming too much CPU/Memory. I would look at investigating if there isn't any memory leaking that might cause more resources to be consumed over time and slowing it after it has run fine for a while, or CPU heavy operations. Glad we are pinpointing the issue :) – Matias Quaranta May 23 '19 at 22:52