0

I have been trying to train a regression model, with big data on AWS Sagemaker.

The instance I used on my last try was ml.m5.12xlarge and I was confident it will work this time, but no. I still get the error.

After some minutes in the training I get this error on Cloudwatch:

[E 07:00:35.308 NotebookApp] KernelRestarter: restart callback <bound method ZMQChannelsHandler.on_kernel_restarted of ZMQChannelsHandler(f92aff37-be6b-48df-a5f5-522bcc6dd072)> failed
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/jupyter_client/restarter.py", line 86, in _fire_callbacks
    callback()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 473, in on_kernel_restarted
    self._send_status_message('restarting')
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 469, in _send_status_message
    self.write_message(json.dumps(msg, default=date_default))
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/tornado/websocket.py", line 337, in write_message
    raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError

Does anyone might know what the error could be?

Alejandro
  • 119
  • 7
  • 1
    How longs was your notebook running for? For training a model the idea usually is to use a small instance backing your notebook in Studio and outsource the training to a transient Job known as a Training Job that is backed by a more powerful instance. – Marc Karp Jul 05 '22 at 23:00
  • It takes like 10-15 min to train, I will take this in mind, thanks! – Alejandro Jul 06 '22 at 13:56

0 Answers0