-1

I have a long-running Java/Gradle process and an Azure Pipelines job to run it.

It's perfectly fine and expected for the process to run for several days, potentially over a week. The Azure Pipelines job is run on a self-hosted agent (to rule out any timeout issues) and the timeout is set to 0, which in theory means that the job can run forever.

Sometimes the Azure Pipelines job fails after a day or two with an error message that says "We stopped hearing from agent". Even when this happens, the job may still be running, as evident when SSH-ing into the machine that hosts the agent.

When I discuss investigating these failures with DevOps, I often hear that Azure Pipelines is a CI tool that is not designed for long-running jobs. Is there evidence to support this claim? Does Microsoft commit to only support running jobs within a certain duration limit?

Based on the troubleshooting guide and timeout documentation page referenced above, there's a duration limit applicable to Microsoft-hosted agents, but I fail to see anything similar for self-hosted agents.

Jura Gorohovsky
  • 9,886
  • 40
  • 46
  • 1
    Although it's *uncommon* to run jobs that take several days, it's not *impossible*. The problem you should be investigating is what's causing transient connection failures. This is something to take up with your network operations team and your internet service provider. You can also approach it by looking into other solutions for asynchronously running long-running tasks, or by looking at parallelizing whatever it is that you're running for days on end across multiple jobs. – Daniel Mann Dec 28 '22 at 18:47
  • I am curious @JuraGorohovsky - why would you want to use Azure Pipelines? is this a genuine need for deployment of software like a long running test or prep of some kind? I have never heard of a deployment needing something that big. – williamohara Dec 28 '22 at 21:02
  • @williamohara It's a data collection and preparation process that consumes multiple external APIs. Some of these APIs have strict usage limits, which makes parallelization almost impossible. As soon as data is collected, a separate, way faster job performs the actual build. – Jura Gorohovsky Dec 28 '22 at 22:09

1 Answers1

1

Agree with @Dianel Mann.

It's not common to run long-time jobs, but as per doc, it should be supported.

stopped hearing from agent could be caused by network problem on the agent, or agent issue due to high cpu, storage, ram...etc. You can check the agent diagnostic log to troubleshoot.

enter image description here

wade zhou - MSFT
  • 1,397
  • 1
  • 3
  • 6