EventHub data bursty with long pauses

Question

I'm seeing multi-second pauses in the event stream, even reading from the retention pool.

Here's the main nugget of EH setup:

BlobContainerClient storageClient = new BlobContainerClient(blobcon, BLOB_NAME);
RTMTest.eventProcessor = new EventProcessorClient(storageClient, consumerGroup, ehubcon, EVENTHUB_NAME);

And then the do nothing processor:

    static async Task processEventHandler(ProcessEventArgs eventArgs)
    {
        RTMTest.eventsPerSecond++;
        RTMTest.eventCount++;
        if ((RTMTest.eventCount % 16) == 0)
        {
            await eventArgs.UpdateCheckpointAsync(eventArgs.CancellationToken);
        }
    }

And then a typical execution:

15:02:23: no events
15:02:24: no events
15:02:25: reqs=643 
15:02:26: reqs=656 
15:02:27: reqs=1280 
15:02:28: reqs=2221 
15:02:29: no events
15:02:30: no events
15:02:31: no events
15:02:32: no events
15:02:33: no events
15:02:34: no events
15:02:35: no events
15:02:36: no events
15:02:37: no events
15:02:38: no events
15:02:39: no events
15:02:40: no events
15:02:41: no events
15:02:42: no events
15:02:43: no events
15:02:44: reqs=3027 
15:02:45: reqs=3440 
15:02:47: reqs=4320 
15:02:48: reqs=9232 
15:02:49: reqs=4064 
15:02:50: reqs=395 
15:02:51: no events
15:02:52: no events
15:02:53: no events

The event hub, blob storage and RTMTest webjob are all in US West 2. The event hub as 16 partitions. It's correctly calling my handler as evidenced by the bursts of data. The error handler is not called.

Here are two applications side by side, left using Redis, right using Event Hub. The events turn into the animations so you can visually watch the long stalls. Note: these are vaccines being reported around the US, either live or via batch reconciliations from the pharmacies.

vaccine reporting animations

Any idea why I see the multi-second stalls?

Thanks.

Some things that would help better understand what you're seeing: How many processors are working against the Event Hub? How many cores does the host machine have? How are events being published (constant stream, bursty)? Do the partitions have a backlog of events waiting or are they being read as they arrive? How large is the average payload for events? — Jesse Squire, Dec 02 '22 at 19:30
The RTMTest thing above was being run as a web job in an app service: immslink-usw2-prod-asp01 (P1v2: 1) — Dennis Cronin, Dec 02 '22 at 20:56
Hit return and it ended my comment. Here's more. The stream tends to be pretty steady as you can see from the side-by-side above with Redis and even when I run against the 7 day persisted event pool it's bursty. I'm reading things pretty much as they arrive. If it falls behind the UI shows "catch up" or some such. Payloads range from about 1k to yikes I'm seeing around 62k. That's too close to the 64k limit for comfort. I wonder if that's what leads to the stall. — Dennis Cronin, Dec 02 '22 at 21:05
Thanks for the clarifications; that's helpful. Shifting down to an answer for some thoughts. — Jesse Squire, Dec 03 '22 at 14:20
It would still be helpful to understand how many partitions a single processor is responsible for and how many CPU cores each host has. — Jesse Squire, Dec 03 '22 at 14:45
@jesse - I just have the one event handler which handles all the partitions. I've seen some other EH reader setups that have partition-specific readers. As far as CPU, it's an Azure app service running on a P1.v2:1 which is this: P1V2 210 total ACU 3.5 GB memory Dv2-Series compute equivalent — Dennis Cronin, Dec 03 '22 at 14:56
And the left hand side which is "prod" using Redis runs on a lower power: S1 100 total ACU 1.75 GB memory A-Series compute equivalent — Dennis Cronin, Dec 03 '22 at 14:59
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/250121/discussion-between-jesse-squire-and-dennis-cronin). — Jesse Squire, Dec 03 '22 at 20:44

Jesse Squire · Answer 1 · 2022-12-03T21:23:26.230

Event Hubs consumers make use of a prefetch queue when reading. This is essentially a local cache of events that the consumer tries to keep full by streaming in continually from the service. To prioritize throughput and avoid waiting on the network, consumers read exclusively from prefetch.

The pattern that you're describing falls into the "many smaller events" category, which will often drain the prefetch quickly if event processing is also quick. If your application is reading more quickly than the prefetch can refill, reads will start to take longer and return fewer events, as it waits on network operations.

One thing that may help is to test using higher values for PrefetchCount and CacheEventCount in the options when creating your processor. These default to a prefetch of 300 and cache event count of 100. You may want try testing with something like 750/250 and see what happens. We recommend keeping at least a 3:1 ratio.

It is also possible that your processor is being asked to do more work than is recommended for consistent performance across all partitions it owns. There's good discussion of different behaviors in the Troubleshooting Guide, and ultimately, capturing a +/- 5-minute slice of the SDK logs described here would give us the best view of what's going on. That's more detail and requires more back-and-forth discussion than works well on StackOverflow; I'd invite you to open an issue in the Azure SDK repository if you go down that path.

Something to keep in mind is that Event Hubs is optimized to maximize overall throughput and not for minimizing latency for individual events. The service offers no SLA for the time between when an event is received by the service and when it becomes available to be read from a partition.

When the service receives an event, it acknowledges receipt to the publisher and the send call completes. At this point, the event still needs to be committed to a partition. Until that process is complete, it isn't available to be read. Normally, this takes milliseconds but may occasionally take longer for the Standard tier because it is a shared instance. Transient failures, such as a partition node being rebooted/migrated, can also impact this.

With you near real-time reading, you may be processing quickly enough that there's nothing client-side that will help. In this case, you'd need to consider adding more TUs, moving to a Premium/Dedicated tier, or using more partitions to increase concurrency.

Update:

For those interested without access to the chat, log analysis shows a pattern of errors that indicates that either the host owns too many partitions and load balancing is unhealthy or there is a rogue processor running in the same consumer group but not using the same storage container.

In either case, partition ownership is bouncing frequently causing them to stop, move to a new host, reinitialize, and restart - only to stop and have to move again.

I've suggested reading through the Troubleshooting Guide, as this scenario and some of the other symptoms tare discussed in detail.

I've also suggested reading through the samples for the processor - particularly Event Processor Configuration and Event Processor Handlers. Each has guidance around processor use and configuration that should be followed to maximize throughput.

thanks for all the help. I used your recommendations for prefetch/cache with no discernible difference. The stalls all tends to be in the 15 second range and the new prefetch/cache values are deployed in the same place: http://vaxbytes.com/dual.html I'm looking into capturing logs now. Thanks! — Dennis Cronin, Dec 03 '22 at 15:52
Here's the log. Appears to show a lot of init stuff, nothing from the actual run tho. My test programs output is prefixed with RTMTtest. I'm now printing the number of events coming from each partition. Nothing suspicious jumping out at me. Btw I'm ex-MS too. https://denniscronin.net/tmp/log.txt — Dennis Cronin, Dec 03 '22 at 17:33
Duh. I was letting it go out of scope. Here's the full log: denniscronin.net/tmp/log2.txt — Dennis Cronin, Dec 03 '22 at 17:46
And here's another. I do see the string "exception" and am studying to see if I can figure out what's happening. "no events" is the string to look for to see we had a stall. https://denniscronin.net/tmp/log3.txt — Dennis Cronin, Dec 03 '22 at 18:08
And still one more. This one I deleted the blob checkpoint to cause it to do the arg.DefaultStartingPosition = EventPosition.Latest in the partition init'ing method cuz I saw those exception messages about out-of-order epoch. https://denniscronin.net/tmp/log4.txt — Dennis Cronin, Dec 03 '22 at 18:18
We're outside the bounds of Stack Overflow for discussion; I've moved analysis of the logs and thoughts into chat. — Jesse Squire, Dec 03 '22 at 21:17

score 0 · Answer 2 · answered Dec 04 '22 at 13:50

0

@jesse very patiently examined my logs and led me to the "duh" moment of realizing I just needed a separate consumer group for this 2nd application of the EventHub data. Now things are rock solid. Thanks Jesse!

answered Dec 04 '22 at 13:50

Dennis Cronin

73
1
7

I'm glad to hear that we were able to figure things out. :) – Jesse Squire Dec 05 '22 at 14:53

EventHub data bursty with long pauses

2 Answers2