AMLS Experiment run stuck in status "Running"

Question

I made an Azure Machine Learning Service Experiment run and logged neural network losses with Jupyter Notebook. Logging worked fine and NN training completed as it should. However, the experiment is stuck in the running status. Shutting down the compute resources does not shut down the Experiment run and I cannot cancel it from the Experiment panel. In addition, the run does not have any log-files.

Has anyone had the same behavior? Run has now lasted for over 24 hours.

score 6 · Accepted Answer · edited May 03 '23 at 10:31

this totally happens from time to time. it is certainly frustrating especially because the "Cancel" button it grayed out. You can use either the CLI or Python SDK to cancel the run.

SDK

>= 1.16.0

As of version 1.16.0 you no longer an Experiment object is no longer needed. Instead you can access using the Run or Workspace objects directly

from azureml.core import Workspace, Experiment, Run, VERSION
print("SDK version:", VERSION)

ws = Workspace.from_config()

run = ws.get_run('YOUR_RUN_ID')
run = Run().get(ws, 'YOUR_RUN_ID') # also works
run.cancel()

< 1.16.0

from azureml.core import Workspace, Experiment, Run, VERSION
print("SDK version:", VERSION)

ws = Workspace.from_config()
exp = Experiment(workspace = ws, name = 'YOUR_EXP_NAME')

run = Run(exp, run_id='YOUR STEP RUN ID')

run.cancel() # or run.fail()

CLI

More CLI details here

az login
az ml run cancel --run YOUR_RUN_ID

Updated CLI command on May 5th, 2023:

az ml job cancel --name YOUR_JOB_NAME --resource-group YOUR_RG --workspace-name YOUR_WS

I tried the SDK, and it didn't take. Usually the SDK works, but this experiment was submitted by service principal. CLI doesn't work either. Any suggestions? — yeamusic21, Dec 21 '20 at 20:46

AMLS Experiment run stuck in status "Running"

1 Answers1

SDK

>= 1.16.0

< 1.16.0

CLI

Linked