0

A timer triggered Azure Webjob running at "B2 Basic" pricing tier is aborted unexpectedly.

    [01/09/2018 17:16:03 > 183d71: SYS INFO] Status changed to Running
    [01/09/2018 17:16:04 > 183d71: INFO] *** WebJob Main: start ***
    [01/09/2018 17:16:06 > 183d71: INFO] Found the following functions:
    [01/09/2018 17:16:06 > 183d71: INFO] BB.WebJobs.Functions.ProcessQueueMessage
    [01/09/2018 17:16:06 > 183d71: INFO] BB.WebJobs.Functions.DailyWorker
    [01/09/2018 17:16:07 > 183d71: INFO] The next 5 occurrences of the schedule will be:
    [01/09/2018 17:16:07 > 183d71: INFO] 1/10/2018 12:15:00 AM
    [01/09/2018 17:16:07 > 183d71: INFO] 1/11/2018 12:15:00 AM
    [01/09/2018 17:16:07 > 183d71: INFO] 1/12/2018 12:15:00 AM
    [01/09/2018 17:16:07 > 183d71: INFO] 1/13/2018 12:15:00 AM
    [01/09/2018 17:16:07 > 183d71: INFO] 1/14/2018 12:15:00 AM
    [01/09/2018 17:16:07 > 183d71: INFO] Job host started
    [01/09/2018 18:15:57 > 183d71: SYS INFO] WebJob is still running
    [01/10/2018 00:15:00 > 183d71: INFO] Executing 'Functions.DailyWorker' (Reason='Timer fired at 2018-01-10T00:15:00.2773840+00:00', Id=85f5338f-f568-4bec-8eaf-e51fa63dffc0)
    [01/10/2018 00:15:00 > 183d71: INFO] DailyWorker started
    [01/10/2018 03:05:26 > 183d71: SYS INFO] Status changed to Starting
    [01/10/2018 03:05:26 > 183d71: SYS INFO] WebJob singleton setting is False
    [01/10/2018 03:05:32 > 183d71: SYS INFO] Run script 'BB.WebJobs.exe' with script host - 'WindowsScriptHost'
    [01/10/2018 03:05:32 > 183d71: SYS INFO] Status changed to Running
    [01/10/2018 03:05:36 > 183d71: INFO] *** WebJob Main: start ***
    [01/10/2018 03:05:39 > 183d71: INFO] Found the following functions:

The same code that accesses a SQL database and request webpages has been running for months on-premises. I am in the process of moving the code to Azure an plans to have it running continuously in a WebJob. The code is kicked off at midnight and executes as expected (verified through database and log-files) until it suddenly is restarted.

This is seen in the log-file above in the line with timestamp: 01/10/2018 03:05:26.

Further inspecting the Kudu trace logs, I found this information:

<step title="Startup Request" date="2018-01-10T03:05:20.437" instance="183d71" url="/" method="GET" type="request" pid="1588,2,8" SCM_SKIP_SSL_VALIDATION="0" ScmType="None" >
  <step title="Cleanup Xml Logs" date="2018-01-10T03:05:20.453" /><!-- duration: 15ms -->
</step><!-- duration: 31ms -->

There are no errors in the eventlog.xml file. It simply seems that some Azure monitor decided to kill the process by sending a "Startup Request".

I have had the code running for two nights, and the exact same thing happens, though the time the event occurs is not the same (more than 30 min difference).

Question: is there some CPU usage limit that is hit? I know the process is not using more memory than available, in which case an exception would be caught and logged.

My WebJob is CPU greedy, but to my understanding the CPU is shared with the Web app and it is my responsibility to decide how the resources should be utilised - which I cannot do, when the WebJobs aborts like it does. I plan to have the WebJob running constantly and to wake up at night and work for 5-6 hours until done and then sleep the rest of the day.

I have searched for similar problems and this is not a case of "always on" not being specified nor that someone other than I restarts the Web app.

Any help is much appreciated!!

freilebt
  • 29
  • 3

1 Answers1

0

Azure occasionally moves Web Apps around at its discretion. To my knowledge, the exact reasons why haven't been made public. I have seen stuff like this in my logs, but it hasn't caused any service interruptions. To see if this is the case, do the following:

  1. Go to your Web App's management blade that's hosting your WebJob.
  2. Navigate to the Diagnose and Solve Problems blade.
  3. Choose Metrics per Instance (Apps).

On the Metrics per Instance (Apps) screen, you'll be able to see various metrics and performance counters per Web App instance. You can toggle these instances on and off to look at how they're performing in aggregate or individually.

Now, if the Web App hosting your WebJob was moved (thus triggering the restart you see), you should see more instances than you have provisioned and a discontinuity in the graph. For example, if you have a single instance provisioned but you see two or more instances in these graphs, your Web App has been moved. Note that the longest lookback period for these charts is 24 hours.

Please post back and let us know if this is what you see.

Rob Reagan
  • 7,313
  • 3
  • 20
  • 49
  • Thanks a lot Rob! This was helpful information. I didn't realise that these charts existed. Here is what I found. No Restart Events (AutoHeal events) are registered. It doesn't seem that multiple instances is the problem as you suggested. The CPU usage is as high as 81% during the time, where the worker is running. The memory usage is growing up to 83% at a peak, right at the time, where the restart occurs. The Committed Memory has a steep peak too, with 10784 Commited MBytes and the number of memory page operations has a sudden very high peak right at the time of restart. – freilebt Jan 11 '18 at 21:44
  • Hence, it looks like it is a memory problem after all. I just don't get that it doesn't result in a memory exception in the worker, that could be caught and logged. I'll make a reduction in memory used, rerun tonight, and see if this indeed was the problem. I'll keep you posted. – freilebt Jan 11 '18 at 21:47