1

I have set up an Azure ML pipeline with a single Python script step. Provided that the compute has already been spun up, initially the pipeline took around 2 minutes with an executionlogs.txt like so:

[2020-09-23 22:36:14Z] Experiment: <EXPERIMENT>
[2020-09-23 22:36:14Z] Run Id:     <RUN_ID>
[2020-09-23 22:36:14Z] Run target: <RUN_TARGET>
[2020-09-23 22:36:14Z] Starting run in Execution Service
[2020-09-23 22:36:17Z] RunId:[<RUN_ID>] ParentRunId:[<PARENT_RUN_ID>] ComputeTarget:[AmlCompute]
[2020-09-23 22:36:19Z] Job is running, job runstatus is Queued
[2020-09-23 22:36:30Z] Job is running, job runstatus is Running
[2020-09-23 22:38:19Z] Job is running, job runstatus is Running
[2020-09-23 22:38:31Z] Job is running, job runstatus is Finalizing
[2020-09-23 22:38:40Z] Job finished, job RunId is <RUN_ID>

Now, with the same spun-up compute and code, I'm finding a run time that is 3 times larger:

[2020-10-14 17:14:23Z] Experiment: <EXPERIMENT>
[2020-10-14 17:14:23Z] Run Id:     <RUN_ID>
[2020-10-14 17:14:23Z] Run target: <RUN_TARGET>
[2020-10-14 17:14:23Z] Starting run in Execution Service
[2020-10-14 17:14:26Z] RunId:[<RUN_ID>] ParentRunId:[<PARENT_RUN_ID>] ComputeTarget:[AmlCompute]
[2020-10-14 17:14:31Z] Job is running, job runstatus is Queued
[2020-10-14 17:14:40Z] Job is running, job runstatus is Running
[2020-10-14 17:16:26Z] Job is running, job runstatus is Running
[2020-10-14 17:18:27Z] Job is running, job runstatus is Running
[2020-10-14 17:19:08Z] Job is running, job runstatus is Finalizing
[2020-10-14 17:20:28Z] Job is running, job runstatus is Finalizing
[2020-10-14 17:21:10Z] Job finished, job RunId is <RUN_ID>

Even 'Finalize' to 'Finish' takes over 2 minutes now! The run times for 70_driver_log.txt are similar for both runs.

I identified a difference in the 55_azureml-execution-tvmps_....txt log. My slow runs contain the following line related to application insights:

appinsightlogger.go:42: Time Out after 20 second retries for flushing the logs, doing another retry before exiting

How can I fix this?

bab689
  • 199
  • 1
  • 12
  • Do you use some SDK? Or do you submit to REST endpoint directly? Are you blocking your script while submit data to Application Insights? – ZakiMa Oct 17 '20 at 04:52
  • This happens both when the pipeline is called using the Python SDK and when it is called from an ADF pipeline. My code does not contain custom logging--could there still be blocking from any default logging? – bab689 Oct 21 '20 at 19:38
  • Can you please point to Python SDK you use? – ZakiMa Oct 21 '20 at 22:05

0 Answers0