4

How do I determine the cause of an EventGrid message delivery failure?

Tried so far:

  • Set a Diagnostic Setting on the EventGrid System Topic
  • Open Log Analytics and query the AEGDeliveryFailureLogs table with
//==================================================//
// Assign variables
//==================================================//
let varStart = ago(5h);
let varEnd = now();
//==================================================//
// Query table
//==================================================//
AegDeliveryFailureLogs
| where TimeGenerated between (varStart .. varEnd)
| project split(Message, ',')
  • Results in 1700+ records in my case
  • Sample result:
[
"outcome=Aborted",
" latencyInMs=180004",
" id=<redacted>",
" outputEventSystemId=<redacted>",
" state=Filtered",
" deliveryTime=5/26/2021 3:22:01 PM",
" deliveryCount=3",
" probationCount=1",
" deliverySchema=EventGridEvent",
" eventSubscriptionDeliverySchema=EventGridEvent",
" outputEventFields=InputEvent| EventSubscriptionId| DeliveryTime| DeliveryCount| State| Id| ProbationCount| LastDeliveryOutcome| DeliverySchema| LastDeliveryAttemptTime| SystemId| UseMappedResourceArmIdForBilling",
" outputEventFieldCount=12",
" requestExpiration=5/26/2021 3:23:01 PM",
" delivered=False id=<redacted>",
" inputEventSystemId=<redacted> publishTime=5/26/2021 3:14:42 PM",
" eventTime=5/26/2021 3:14:42 PM",
" eventType=Microsoft.Storage.BlobCreated",
" deliveryTime=1/1/0001 12:00:00 AM",
" filteringState=FilteringPending",
" inputSchema=EventGridEvent",
" publisher=MICROSOFT-STORAGE-STORAGEACCOUNTS.<redacted>.EVENTGRID.AZURE.NET",
" size=1162",
" subject=/blobServices/default/containers/<redacted>/blobs/2021-05-25/<redacted>/<redacted>.csv",
" inputEventFields=Id| PublishTime| SerializedBody| EventType| Topic| Subject| FilteringHashCode| SystemId| Publisher| FilteringTopic| TopicCategory| DataVersion| MetadataVersion| InputSchema| EventTime| FilteringPolicy",
" inputEventFieldCount=16",
" type=AzureFunction",
" subType=NotApplicable",
" supportsBatching=False",
" aadIntegration=False",
" useSystemIdentity=False",
"armId=/subscriptions/<redacted>/resourceGroups/<redacted>/providers/Microsoft.Web/sites/<redacted>/functions/<redacted>",
" deliveryResponse=",
" httpStatusCode=InternalServerError",
" errorType=UnexpectedError",
" errorMessage=An unexpected error has occurred. Please report the x-ms-request-id header value to our forums for assistance or raise a support ticket.",
" errorMessage=The operation timed out after: 00:03:00"
]   

There is a clear outcome=Aborted along with an httpStatusCode/errorType/errorMessage that indicate an issue, but this is where the investigation stops.

What is causing these EventGrid message issues and how do I determine cause?

ericOnline
  • 1,586
  • 1
  • 19
  • 54
  • it looks like, the AEG didn't receive a message delivery status from the delivery endpoint (azure function) within the 3 minutes. – Roman Kiss May 27 '21 at 05:56
  • Could you please describe 1.) how you discerned this from the above message and 2.) for what reasons this might occur 1700x in a 5hr period in a normally working Azure Function? – ericOnline May 27 '21 at 06:00
  • I do recommend to read the https://learn.microsoft.com/en-us/azure/event-grid/delivery-and-retry#failure-codes – Roman Kiss May 27 '21 at 06:21
  • Oh I've read all the docs. The `httpStatusCode` of `InternalServerError` is not in that doc. – ericOnline May 27 '21 at 06:25
  • Basically what are some possible reasons, that I can validate in a log, for **why** an EventGrid message would fail to deliver to an Azure Function (a P2V2 premium Function Plan at that)? – ericOnline May 27 '21 at 06:35
  • 2
    Your sample shows, that the event message has been tried to deliver 3 times and based on the retry policy the delivery process has been aborted. In the case when the dead-lettering is enabled, the message is stored (after 5 minutes) in the dead letter storage with a details of the delivery failed reason. Your subscriber is running in the azure function premium plan, so have a look at how it is scaling out, etc., The other option is to deliver an event message to the queue event handler resource and then pull-out it by an azure function trigger. – Roman Kiss May 27 '21 at 07:48
  • Thanks for the details. I'm assuming you're referring to `deliveryCount=3`. Can you point me to a Microsoft doc where the key/values in the above message are explained? Things like `state=Filtered` would lead me to believe that this message was actually filtered by the AEG Advanced Filters. There are duplicate `deliveryTime` and `errorMessage` keys and also appear to be missing delimiters (commas) in the message (`publishedTime` and `id`) making parsing more difficult. – ericOnline May 27 '21 at 16:59

2 Answers2

1

It appears the only way to determine why an Event Grid message failed to be delivered is to turn on Dead Lettering.

Here is a sample Dead Letter message.

  • It clearly shows the deadLetterReason and some other valuable debugging information:
[
    {
        "id": "redacted",
        "eventTime": "2021-06-07T22:52:42.7156553Z",
        "eventType": "Microsoft.Storage.BlobCreated",
        "dataVersion": "",
        "metadataVersion": "1",
        "topic": "/subscriptions/redacted/resourceGroups/redacted/providers/Microsoft.Storage/storageAccounts/redacted",
        "subject": "/blobServices/default/containers/redacted/blobs/20210525/redacted.csv",
        "deadLetterReason": "TimeToLiveExceeded",
        "deliveryAttempts": 12,
        "lastDeliveryOutcome": "TimedOut",
        "publishTime": "2021-06-07T22:52:43.1497044Z",
        "lastDeliveryAttemptTime": "2021-06-08T21:50:11.3677712Z",
        "data": {
            "api": "PutBlockList",
            "clientRequestId": "redacted",
            "requestId": "redacted",
            "eTag": "0x8D92A06F5883BA6",
            "contentType": "application/octet-stream",
            "contentLength": 4767672448,
            "blobType": "BlockBlob",
            "blobUrl": "https://redacted.blob.core.windows.net/redacted/20210525/redacted.csv",
            "url": "https://redacted.blob.core.windows.net/redacted/20210525/redacted.csv",
            "sequencer": "000000000000000000000000000093100000000000001286",
            "storageDiagnostics": {
                "batchId": "redacted"
            }
        }
    }
]
ericOnline
  • 1,586
  • 1
  • 19
  • 54
  • I enabled dead lettering and I still don't get anything in the blob. It is also not showing anything in Logs. It is though registering the Dropped event under the Topic Graph overview. But it is impossible for me to see the detailed log anywhere. – Oliver Nilsen Sep 15 '22 at 07:37
  • 1
    They did appear but it took 10 mins before the json file was created in the blob container. I've even set Retry Policy on Subscription to Max Event Delivery Attempts: 3 and Event Time to Live: 1 min, in order to see the failed deliveries a bit faster. I think in general EvendGrid Topics failed delivery error handling and logging is a bit cumbersome. – Oliver Nilsen Sep 15 '22 at 07:52
0

You can also try setting up the diagnostic settings for the event grid topic and create a log analytics workspace mapping to that. You can find the error message and reason in the logs of the log analytics workspace.

Prasanna
  • 187
  • 1
  • 10