I have set up an Azure ML pipeline with a single Python script step. Provided that the compute has already been spun up, initially the pipeline took around 2 minutes with an executionlogs.txt
like so:
[2020-09-23 22:36:14Z] Experiment: <EXPERIMENT>
[2020-09-23 22:36:14Z] Run Id: <RUN_ID>
[2020-09-23 22:36:14Z] Run target: <RUN_TARGET>
[2020-09-23 22:36:14Z] Starting run in Execution Service
[2020-09-23 22:36:17Z] RunId:[<RUN_ID>] ParentRunId:[<PARENT_RUN_ID>] ComputeTarget:[AmlCompute]
[2020-09-23 22:36:19Z] Job is running, job runstatus is Queued
[2020-09-23 22:36:30Z] Job is running, job runstatus is Running
[2020-09-23 22:38:19Z] Job is running, job runstatus is Running
[2020-09-23 22:38:31Z] Job is running, job runstatus is Finalizing
[2020-09-23 22:38:40Z] Job finished, job RunId is <RUN_ID>
Now, with the same spun-up compute and code, I'm finding a run time that is 3 times larger:
[2020-10-14 17:14:23Z] Experiment: <EXPERIMENT>
[2020-10-14 17:14:23Z] Run Id: <RUN_ID>
[2020-10-14 17:14:23Z] Run target: <RUN_TARGET>
[2020-10-14 17:14:23Z] Starting run in Execution Service
[2020-10-14 17:14:26Z] RunId:[<RUN_ID>] ParentRunId:[<PARENT_RUN_ID>] ComputeTarget:[AmlCompute]
[2020-10-14 17:14:31Z] Job is running, job runstatus is Queued
[2020-10-14 17:14:40Z] Job is running, job runstatus is Running
[2020-10-14 17:16:26Z] Job is running, job runstatus is Running
[2020-10-14 17:18:27Z] Job is running, job runstatus is Running
[2020-10-14 17:19:08Z] Job is running, job runstatus is Finalizing
[2020-10-14 17:20:28Z] Job is running, job runstatus is Finalizing
[2020-10-14 17:21:10Z] Job finished, job RunId is <RUN_ID>
Even 'Finalize' to 'Finish' takes over 2 minutes now! The run times for 70_driver_log.txt
are similar for both runs.
I identified a difference in the 55_azureml-execution-tvmps_....txt
log. My slow runs contain the following line related to application insights:
appinsightlogger.go:42: Time Out after 20 second retries for flushing the logs, doing another retry before exiting
How can I fix this?