10

I am trying to set up a multi-model endpoint (or more accurately re-set it up as I am pretty sure it was working a while ago, on an earlier version of sagemaker) to do language translation. But am constantly met with the same issue. This is what I am trying to run (from a notebook on sagemaker):

import sagemaker
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import JSONSerializer, JSONDeserializer

role = 'role_name...'
pytorch_model = PyTorchModel(model_data='s3://foreign-language-models/opus-mt-ROMANCE-en.tar.gz',
                             role=role,
                             framework_version="1.3.1",
                             py_version="py3",
                             source_dir="code",
                             entry_point="deploy_multi_model.py")
x = pytorch_model.predictor_cls(endpoint_name='language-translation')
x.serializer = JSONSerializer()
x.deserializer = JSONDeserializer()

x.predict({'model_name': 'opus-mt-ROMANCE-en', 'text': ["Hola que tal?"]})

To which I am met with the error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "{
  "code": 500,
  "type": "InternalServerException",
  "message": "Worker died."
}

And when I investigate the the logs the error links to the only notable one says:

epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

But I cannot figure out why this is happening. Any help would be greatly appreciated as this is currently driving me insane! And if you need any more information from me to help, don't hesitate to ask.

Oisín Moran
  • 211
  • 2
  • 6

1 Answers1

2

This question is old and may not be answered, however, the way to proceed when faced with these problems is very straightforward:

This is the service unavailability message due to a generic internal error precisely. You should open the full endpoint logs from CloudWatch and see where it broke down. The cause can be of any type.

To debug the problem, reach the logs directly by clicking on "View logs" in the training process screen or by going to CloudWatch at the path:

CloudWatch / Log groups /aws/sagemaker/TrainingJobs / <your_job_name>.

If the problem occurs right away, it could be related to incorrect loading of model data or passing inference data incorrectly.

Giuseppe La Gualano
  • 1,491
  • 1
  • 4
  • 24