0

Details

I am currently training Yolov5 using my dataset on AWS Sagemaker. I have estimated that the training duration would be more than 8 hours or more (depends on the amount of epochs); which would be outside my working hours. I have set a Lifecycle config that will shut off the server if it is idle for 10 minutes. script (I have question on this too)

Objective and Issue

Hence, I planned to train the model all night. By "all night", I mean - I ran the model on the cloud, close the browser, shut down my laptop, and come back to workplace tomorrow and get the result. I am monitored the training using ClearML.

The problem is, I cannot close the Sagemaker browser and let the training run. I have tried two ways to solve this, but it still failed.

What I have done

  1. Run on Notebook.
  • Issue: Closing the JupyterLab browser will stop the training process.
  1. Run on terminal with tmux.
  • What I did, I created a session using tmux, trained the model, then detached it.
  • I closed the JupyterLab browser, and it worked well as the training is still run. Now I can close the browser.
  • However, after several epochs, as I monitored on ClearML, training has stopped. The server has shut off, which I presumed that Sagemaker assumed it as idle (maybe because nothing run on Notebook)

Now I have no idea on how I can run it all night. Also what is the actual meaning of idle in Sagemaker? It is referred to the notebook, or terminal?

It's a pleasure if anyone can help. Thank you.

pynexj
  • 19,215
  • 5
  • 38
  • 56
curiouscheese
  • 117
  • 1
  • 9
  • 1
    It's not recommended to use Notebooks for long running jobs - have you tried Training jobs? Or even Notebook jobs (https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html). The idleness is defined here - https://github.com/jupyter/notebook/issues/4634 and it's dependent on the kernel (notebook) and not the terminal. – durga_sury Jun 02 '23 at 21:39
  • @durga_sury based on the github discussion about the idle, so running (train model) something on the terminal does not make it 'busy' ? – curiouscheese Jun 04 '23 at 02:36
  • 1
    yes, the idleness is for the kernels and not the terminal itself. Also, notebooks have a session duration of 12 hours by default, so if you have anything running on the terminal longer than that, it would be killed too. – durga_sury Jun 05 '23 at 20:28
  • @durga_sury I see. Thank you for the answer. If you don't mind to answer, usually where does a professional ML engineer/Data Scientist run a longer training job instead of AWS Sagemaker? Are EC2 is good or any other services? – curiouscheese Jun 06 '23 at 00:49

0 Answers0