Details
I am currently training Yolov5 using my dataset on AWS Sagemaker. I have estimated that the training duration would be more than 8 hours or more (depends on the amount of epochs); which would be outside my working hours. I have set a Lifecycle config that will shut off the server if it is idle for 10 minutes. script (I have question on this too)
Objective and Issue
Hence, I planned to train the model all night. By "all night", I mean - I ran the model on the cloud, close the browser, shut down my laptop, and come back to workplace tomorrow and get the result. I am monitored the training using ClearML.
The problem is, I cannot close the Sagemaker browser and let the training run. I have tried two ways to solve this, but it still failed.
What I have done
- Run on Notebook.
- Issue: Closing the JupyterLab browser will stop the training process.
- Run on terminal with
tmux
.
- What I did, I created a session using
tmux
, trained the model, then detached it. - I closed the JupyterLab browser, and it worked well as the training is still run. Now I can close the browser.
- However, after several epochs, as I monitored on ClearML, training has stopped. The server has shut off, which I presumed that Sagemaker assumed it as idle (maybe because nothing run on Notebook)
Now I have no idea on how I can run it all night. Also what is the actual meaning of idle in Sagemaker? It is referred to the notebook, or terminal?
It's a pleasure if anyone can help. Thank you.