-2

I have a StackEnsemble model trained with autoML functionality of the of the Azure ML Workspace. I get an error as below (CrashLoopBackOff) when I try to deploy it as a Webservice. Now, I strongly suspect it has something to do with the model itself / depencencies it needs. When I swap the model name in the score.py to another one, which is not StackEnsemble (with scalers) but just a normal XGBoost, then the service gets created without issues.

I have following questions: - how would I find out, which models / algorithms are inside of the StackEnsemble in order to build the container / dependencies list properly? - is there any way to find out what is actually the error there? I mean besides creating my local container and debug it there ... I tried to fetch the logs with the service.get_logs() as per the documentation but there is nothing there, just the last 5 lines which do not point to any issue.

Please advice.

WebserviceException: Service deployment polling reached non-successful terminal state, current service state: Failed
Error:
{
  "code": "AciDeploymentFailed",
  "message": "Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check the logs for your container instance: classifier-bwp-ls5923-v1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. \nYou can also try to run image mlws219f9669.azurecr.io/classifier-bwp-ls5923-v1:4 locally. Please refer to http://aka.ms/debugimage#service-launch-fails for more information.",
  "details": [
    {
      "code": "CrashLoopBackOff",
      "message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check the logs for your container instance: classifier-bwp-ls5923-v1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. \nYou can also try to run image mlws219f9669.azurecr.io/classifier-bwp-ls5923-v1:4 locally. Please refer to http://aka.ms/debugimage#service-launch-fails for more information."
    }
  ]
}
desertnaut
  • 57,590
  • 26
  • 140
  • 166
damucka
  • 1
  • 2

1 Answers1

0

I'm not sure how to get the models being used in the Ensemble, but there are a few other things you can try to mitigate yourself in the meantime.

When your service is stuck in a CrashLoopBackoff, it's going to keep rebooting which means that the logs are going to keep getting wiped since they're stored on the container itself. A quick fix here is to just run the get_logs() function several times to see all of what's happening.

To get historical information, make sure that appInsightsEnabled is set in your InferenceConfig so that you can track logs in the AppInsights attached to your workspace.

Other than dependency mismatches, the most common cause of CrashLoopBackoff is the service not being given enough memory to actually load and score against the model. Try increasing the Memory reservation for the service.

chpirill
  • 1
  • 1
  • The issue was that I had to add the pip azureml-sdk[automl] to the myenv.yml, which was not there before. Thank you for the hints with the appInsightsEnabledm memory - this is important, because I aim to deploy several webservices on that container, so I would like to save the costs giving them relatively low cpu/memory each, and the periodic printing of the get_logs(). – damucka Aug 09 '19 at 08:10