2

I have several tasks on Azure Batch which are stuck in Running state although the node server does not know anything about it (not running there, no folders found). Any task manipulation in GUI (Terminate, Delete, Show files on node) end with There was an error while terminating task t20171129-0010-03. The server returned '500 Internal Server Error'.. This happened several times on different pools / jobs / tasks.

Now I have checked the debug files on node itself and the issue seems to be caused by failed to extend lease and subsequently deleting the task from node, but aborting attempt to update task table without an active queue lease.

Is this something I can avoid, or is it just a bug in the Azure Batch service? What exactly is the "lease" and how often it needs to be renewed? (My Azure subscription does not contain Technical Support).

Interesting lines from log:

agent.task.lease■lease.py■_renew_lease_unsafe_async■106■1398■MainThread■139690855581440■extending lease for pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06
requests.packages.urllib3.connectionpool■connectionpool.py■_make_request■387■1398■Thread-1■139690661328640■"PUT /pd1batch-a-fa357c64-5c3d-4db8-9366-680943d2c20d/messages/821bf60d-3ba5-43a1-9c3d-c7500758bfea?sv=2015-07-08&se=2017-12-06T00%3A42%3A17Z&sp=up&sig=XXX&visibilitytimeout=360&popreceipt=AwAAAAMAAAAAAAAAFePc%2BR5u0wEBAAAA HTTP/1.1" 404 221
azurestorage.helper.HTTPNotFoundError: 404 Client Error: The specified message does not exist. for url: https://watbl2prod1.queue.core.windows.net/pd1batch-a-fa357c64-5c3d-4db8-9366-680943d2c20d/messages/821bf60d-3ba5-43a1-9c3d-c7500758bfea?sv=2015-07-08&se=2017-12-06T00%3A42%3A17Z&sp=up&sig=mU9501N4HHuDeRWuA7qMNni9M%2Fbb83OWLF8AW0%2B4nQE%3D&visibilitytimeout=360&popreceipt=AwAAAAMAAAAAAAAAFePc%2BR5u0wEBAAAA
agent.task.lease■lease.py■_renew_lease_unsafe_async■119■1398■MainThread■139690855581440■failed to extend lease for pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06
agent.task.manager■manager.py■handle_task_lease_extension_error_async■4713■1398■MainThread■139690855581440■deleting task pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06$0 because lease was lost
agent.task.manager■manager.py■_postprocess_execute_task_async■2255■1398■MainThread■139690855581440■updating row in task table for: pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06$0
agent.task.manager■manager.py■_update_tasktable_entity_async■1624■1398■MainThread■139690855581440■aborting attempt to update task table without an active queue lease for pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06$0

Entire log: https://pastebin.com/fkqTRuBe

Marki555
  • 6,434
  • 3
  • 37
  • 59

1 Answers1

1

Currently, Azure Batch tasks have a limit of a total lifetime of 7 days, from the time it is submitted to the job as noted here.

When this limit is reached, there are issues in the system that prevent the update of the task state from propagating. However, if you observe the node state where the task ran, it will return to idle (assuming no other tasks are scheduled to it or are currently running).

You have a few options to avoid this situation. If your workload is amenable to scale up or migrating to a more performant VM type such that your task completes in under the time limit. If you can scale out your problem (or scale it out further) by performing distribution computation or chunking the problem into smaller sizes and running it in an embarrassingly parallel fashion this may help resolve your issue.

The current behavior is not very user friendly. There are plans to increase this limit in the future.

fpark
  • 2,304
  • 2
  • 14
  • 21
  • If I set a timelimit when creating a task (e.g. 1 day) and allow task to restart, will it help or will those task restarts accumulate to 7 days and fail anyway? My task runs many small "sub-tasks" itself, so it is fine to restart it and it will continue where it left. – Marki555 Dec 06 '17 at 18:42
  • This will help if your logic knows how to checkpoint/restart as you suggest. You can't rely just on task retries (as that uses the same task), but if you create a new task for each restart, this will work. Each task is independent and there is no time limit on the job itself. You can automatically end the task using the max wallclock time task constraint if that suits your scenario. Further, you might be able to automate the entire thing using job recurrences. – fpark Dec 06 '17 at 20:41
  • As of today the lifetime is 180 days. – Evandro Pomatti Mar 11 '23 at 19:10