I have several tasks on Azure Batch which are stuck in Running state although the node server does not know anything about it (not running there, no folders found). Any task manipulation in GUI (Terminate, Delete, Show files on node) end with There was an error while terminating task t20171129-0010-03. The server returned '500 Internal Server Error'.
. This happened several times on different pools / jobs / tasks.
Now I have checked the debug files on node itself and the issue seems to be caused by failed to extend lease and subsequently deleting the task from node, but aborting attempt to update task table without an active queue lease.
Is this something I can avoid, or is it just a bug in the Azure Batch service? What exactly is the "lease" and how often it needs to be renewed? (My Azure subscription does not contain Technical Support).
Interesting lines from log:
agent.task.lease■lease.py■_renew_lease_unsafe_async■106■1398■MainThread■139690855581440■extending lease for pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06
requests.packages.urllib3.connectionpool■connectionpool.py■_make_request■387■1398■Thread-1■139690661328640■"PUT /pd1batch-a-fa357c64-5c3d-4db8-9366-680943d2c20d/messages/821bf60d-3ba5-43a1-9c3d-c7500758bfea?sv=2015-07-08&se=2017-12-06T00%3A42%3A17Z&sp=up&sig=XXX&visibilitytimeout=360&popreceipt=AwAAAAMAAAAAAAAAFePc%2BR5u0wEBAAAA HTTP/1.1" 404 221
azurestorage.helper.HTTPNotFoundError: 404 Client Error: The specified message does not exist. for url: https://watbl2prod1.queue.core.windows.net/pd1batch-a-fa357c64-5c3d-4db8-9366-680943d2c20d/messages/821bf60d-3ba5-43a1-9c3d-c7500758bfea?sv=2015-07-08&se=2017-12-06T00%3A42%3A17Z&sp=up&sig=mU9501N4HHuDeRWuA7qMNni9M%2Fbb83OWLF8AW0%2B4nQE%3D&visibilitytimeout=360&popreceipt=AwAAAAMAAAAAAAAAFePc%2BR5u0wEBAAAA
agent.task.lease■lease.py■_renew_lease_unsafe_async■119■1398■MainThread■139690855581440■failed to extend lease for pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06
agent.task.manager■manager.py■handle_task_lease_extension_error_async■4713■1398■MainThread■139690855581440■deleting task pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06$0 because lease was lost
agent.task.manager■manager.py■_postprocess_execute_task_async■2255■1398■MainThread■139690855581440■updating row in task table for: pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06$0
agent.task.manager■manager.py■_update_tasktable_entity_async■1624■1398■MainThread■139690855581440■aborting attempt to update task table without an active queue lease for pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06$0
Entire log: https://pastebin.com/fkqTRuBe