I manage two on-premesis Azure DevOps Servers both running version 2019.Update1.1 (17.153.29522.3).
One of these two servers stops sending "Work Item" notifications on a regular basis. When this issue occurs, the notifications do actually exist, and are held by the server until a reboot of the application tier is done. After the reboot of that machine, hundreds of backlogged work item notifications are sent out at once in a large "flurry" (all at once).
I have monitored the issue over the course of 8 weeks. During that time, the issue has occurred 6 times. All 6 times it has started on a Thursday evening. I do not know what initiates the issue. However, if it happens, it always happens on a Thursday evening. (Just not every Thursday evening.)
The issue occurred last night 7/23/2020.
On the OI Job Monitoring page this morning, after the reboot I see that the "WorkItem User Delivery" Job failed with a result of "Inactive" and a run time of 14 hours and 9 minutes.
History Id Priority Result Host Name Job Name Queue Time Time in Queue Run Time
9227035 10 Inactive [TEAM FOUNDATION]DefaultCollection WorkItem User Delivery 7/23/2020 2:30 PM 0:00:00.28 14:09:26.557
Details are:
Job History Details
[9227035] [TEAM FOUNDATION]DefaultCollection:WorkItem User Delivery
Result: Inactive
Priority: 10
Queued Reasons: Queued
Host: [TEAM FOUNDATION]DefaultCollection
Job Name: WorkItem User Delivery
Agent: Process does not exist anymore
Queue Time: Thursday, July 23, 2020 2:30 PM
Queue Duration: 0:00:00.28
Start Time: Thursday, July 23, 2020 2:30 PM
Run Duration: 14:09:26.557
End Time: Friday, July 24, 2020 4:39 AM
Result Message:
Host Id: 33d3f91e-acc9-41b9-8744-fc2ed904235a
Job Id: 631f49b3-46e1-42ec-8fff-081bd176c18a
Agent Id: 3d718e2d-8c24-4297-9088-9e58c2ed9a6e
I would like to ask for help in diagnosing this issue as well as any recommendations on how to resolve it.
Note: I am reviewing the old case of a similar issue mentioned here: Tfs 2017 Update 2 - "WorkItem user delivery" job stuck - no email/soap notifications
I'd also like to ask if any of the advice in that case is still relevant to Azure DevOps Server 2019.Update1.1. His resolution message does not seem to apply, however.
EDIT as of 7/31/2020 Follows....
True to the pattern, the WorkItem User Delivery job hung again on Thursday 7/30/2020 at 4:27:32 Pacific Time.
This time I was monitoring the TfsJobAgent.exe via procmon for the previous several hours, filtering on ":587".
After the interation at 4:27, there was no further attempt from that process to initate traffic to the email server.
The job failure message that was generated at that time:
E-Mail job summary: 1 errors, 0 warnings.The detailed log for this job is available here: https://.com/DefaultCollection/_apis/notification/DiagnosticLogs/a96d6177-beef-477a-a2ee-2c31433214d0/entries/52195e2c-6e55-4db6-aea7-e9d16279da80Previous job run succeeded.Current job run succeeded - consecutive successes: 3843Properties and stats:timings:NotificationSql=6 PreDelivery=0 Filter=0 Identity=0 NotificationDeliveryTimeAverage=30014 NotificationDeliveryTimeAttempts=1 Summary: read=1 attempted=1 succeeded=0 failed=1 failedAlertable=0 failedNonAlertable=0 failedRetryable=1 recipientCount=1 failedWithFailed=1 DeploymentCompletedEvent: read=1 attempted=1 succeeded=0 failed=1 failedAlertable=0 failedNonAlertable=0 failedRetryable=1 recipientCount=1 Smtp:UnassignedBatch 1 First 423866 Started 08:11:38.4594602 Finished 08:12:08.6917156:Notification 423866 for DeploymentCompletedEvent result Failed!TF271000: An error occurred while attempting to send an email notification to the following address: redacted.user@co.com. Verify that the email notification settings are correct in the Team Foundation Administration Console.SmtpException:Failure sending mail.@ExceptionDispatchInfo.Throw:0@TaskAwaiter.HandleNonSuccessAndDebuggerNotification:0@d__2.MoveNext:0@ExceptionDispatchInfo.Throw:0@TaskAwaiter.HandleNonSuccessAndDebuggerNotification:0@d__6.MoveNext:0+Inner:WebException:The operation has timed out.08:12:08.6917156:ID:423866 - NextProcessTime=7/31/2020 9:12:08 AM Prior Attempts=1008:12:08.9573417:Previous job run succeeded.08:12:08.9573417:Current job run succeeded - consecutive successes: 3843
The Diagnostic Log at the URL above returned:
Internal log text:
{"count":1,"value":[{"batches":[{"startTime":"08:11:38.4594602","endTime":"08:12:08.6917156","notificationIds":"423866","notificationCount":1,"problematicNotifications":[{"id":423866,"result":"Failed","eventId":1293755,"eventType":"DeploymentCompletedEvent","subscriptionId":"477","stats":{},"recipients":{"7833f10d-690f-45f4-8ed8-40e012a86868":{"recipient":{"id":"7833f10d-690f-45f4-8ed8-40e012a86868","displayName":""},"status":null}},"messages":[{"level":1,"time":"08:12:08.6917156","message":"TF271000: An error occurred while attempting to send an email notification to the following address: redacted.co.com. Verify that the email notification settings are correct in the Team Foundation Administration Console.\r\nSmtpException:Failure sending mail.\r\n@ExceptionDispatchInfo.Throw:0\r\n@TaskAwaiter.HandleNonSuccessAndDebuggerNotification:0\r\n@d__2.MoveNext:0\r\n@ExceptionDispatchInfo.Throw:0\r\n@TaskAwaiter.HandleNonSuccessAndDebuggerNotification:0\r\n@d__6.MoveNext:0\r\n+Inner:\r\nWebException:The operation has timed out.\r\n\r\n"}]}]}],"result":"PartiallySucceeded","stats":{"timings":{"NotificationSql":6,"PreDelivery":0,"Filter":0,"Identity":0,"NotificationDeliveryTimeAverage":30014,"NotificationDeliveryTimeAttempts":1},"All":{"read":1,"attempted":1,"succeeded":0,"failed":1,"failedAlertable":0,"failedNonAlertable":0,"failedRetryable":1,"recipientCount":1,"Failed":1},"DeploymentCompletedEvent":{"read":1,"attempted":1,"succeeded":0,"failed":1,"failedAlertable":0,"failedNonAlertable":0,"failedRetryable":1,"recipientCount":1,"Failed":1},"EmailCounters":{"emailProcessed":1,"emailStarted":1,"emailFailed":1,"emailAttemptedMessages":1,"emailAttemptedRecipients":1},"EmailTimings":{"AllSendMailAverage":30002,"AllSendMailAttempts":1,"SucceededSendMailAverage":0,"SucceededSendMailAttempts":0,"FailedSendMailAverage":30002,"FailedSendMailAttempts":1,"Subscriber":0,"TemplateFetch":0,"Transform":12}},"logType":"NotificationDelivery","source":"a96d6177-beef-477a-a2ee-2c31433214d0","id":"52195e2c-6e55-4db6-aea7-e9d16279da80","startTime":"2020-07-31T08:11:38.4594602Z","endTime":"2020-07-31T08:12:09.0511168Z","description":"Notifications E-Mail Delivery","warnings":0,"errors":1,"activityId":"6bddc09f-e357-413c-9e3f-66092a38968d","properties":{"JobTimeSlice":"300000","JobTimeoutMultiple":"12","JobRestartDelay":"30","ProcessingJobDelaySeconds":"300","MaxConcurrentInstances":"5","InstancesExceededRescheduleDelaySeconds":"30000","RerunNotificationJobDelaySecs":"30000","ContributionFallbackModeRescheduleDelaySeconds":"60000","OrganizationAadTenantId":"00000000-0000-0000-0000-000000000000","EventingEnabled":"True","FailedRetryInterval":"300","NotificationBatchSize":"100","NotificationStatFlushFrequency":"5","CoreDeliveryStartDelay":"10","EmailEnabled":"True","MaxMailSenders":"8","MaxMailRecipients":"90","AsciiOnlyEmailAddresses":"False","Feature: NotificationJob.DisablePlainTextEmail":"True","SmtpServer":"smtp.office365.com","SmtpPort":"587","HasSmtpUser":"true","HasSmtpPassword":"true","HostName":"DefaultCollection","SendingDomain":"redacted.co.com","NextProcessTime":"7/31/2020 8:15:38 AM"},"messages":[{"level":3,"time":"08:12:08.6917156","message":"ID:423866
- NextProcessTime=7/31/2020 9:12:08 AM Prior Attempts=10"},{"level":3,"time":"08:12:08.9573417","message":"Previous job run succeeded."},{"level":3,"time":"08:12:08.9573417","message":"Current job run succeeded - consecutive successes: 3843"}]}]}
The server remained this way for 10 hours during the investigation. During this time, the email server could be pinged. The admin console could successfully send email alert test messages.
After the 10 hour period, only the TFSJobAgent service was restarted. Once that was restarted a flurry of email messages were successfully sent. Procmon was able to record the traffic that occurred after the TfsJobAgent restart.
Some tentative conclusions so far:
1. The issue cannot be a App Tier machine issue nor an email server issue:
a) Tfs Administration Console can send messages during a "hang".
b) Only a restart of the TfsJobAgent is required to reset the "hang".
2. The TfsJobAgent process does not even try to contact the email server during the "hang".
a) Outside of the "hang" window, Procmon can record its attempts.
b) During the "hang" window, Procmon sees no attempt to contact the email server.
3. The TfsJobAgent code believes it has made the attempt but never actually made it.
a) It records results to contact as "operation timed out" which procmon never saw.
Tentative hypothesis: Either there is a bug in the TfsJobAgent....or something that it processed....which compromised that processes ability to send/receive network traffic (at most) or traffic to the email server (at least).