Worker died/Worker disconnected Sagemaker

Question

I am trying to set up a multi-model endpoint (or more accurately re-set it up as I am pretty sure it was working a while ago, on an earlier version of sagemaker) to do language translation. But am constantly met with the same issue. This is what I am trying to run (from a notebook on sagemaker):

import sagemaker
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import JSONSerializer, JSONDeserializer

role = 'role_name...'
pytorch_model = PyTorchModel(model_data='s3://foreign-language-models/opus-mt-ROMANCE-en.tar.gz',
                             role=role,
                             framework_version="1.3.1",
                             py_version="py3",
                             source_dir="code",
                             entry_point="deploy_multi_model.py")
x = pytorch_model.predictor_cls(endpoint_name='language-translation')
x.serializer = JSONSerializer()
x.deserializer = JSONDeserializer()

x.predict({'model_name': 'opus-mt-ROMANCE-en', 'text': ["Hola que tal?"]})

To which I am met with the error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "{
  "code": 500,
  "type": "InternalServerException",
  "message": "Worker died."
}

And when I investigate the the logs the error links to the only notable one says:

epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

But I cannot figure out why this is happening. Any help would be greatly appreciated as this is currently driving me insane! And if you need any more information from me to help, don't hesitate to ask.

Hi, I am facing the same issue. Did you find any solution for this?? — Rahul Kedia, Jul 25 '22 at 11:11

score 2 · Answer 1 · answered Oct 21 '22 at 20:20

This question is old and may not be answered, however, the way to proceed when faced with these problems is very straightforward:

This is the service unavailability message due to a generic internal error precisely. You should open the full endpoint logs from CloudWatch and see where it broke down. The cause can be of any type.

To debug the problem, reach the logs directly by clicking on "View logs" in the training process screen or by going to CloudWatch at the path:

CloudWatch / Log groups /aws/sagemaker/TrainingJobs / <your_job_name>.

If the problem occurs right away, it could be related to incorrect loading of model data or passing inference data incorrectly.

Worker died/Worker disconnected Sagemaker

1 Answers1