0

I have a aws ec2 p3.2xlarge instance. I can ssh and connect to it easily. However about after 20 minutes, while I am running a keras model on it, it resets the connection and I am kicked out with the error Connection reset by 54.161.50.138 port 22. I then am able to reconnect, but have to start training the model over again because my progress was lost. This happens every time I connect to the instance. Any idea why this is happening?

For ssh I am using gow which lets me run linux commands on windows - https://github.com/bmatzelle/gow/wiki I checked my public ip address before and after the reset and it was the same. I also looked at the cpu usage using amazon CloudWatch, and it was normal - 20%.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    See [this post](https://stackoverflow.com/questions/25084288/keep-ssh-session-alive) but you need to have something that runs without requiring a connection all the time. Can you run your program in the background? – stdunbar Jan 05 '19 at 22:46
  • Are you on a home Internet? If yes, check your public IP address (https://www.whatismyip.com/). After the reset check your public IP address again. Two more things to check, your ISP may be blocking long connections. Your home Internet router may be having problems. Make sure the firmware is up to date. What software are you using for ssh? Is your ssh client using Keep-Alives? Edit your question with more information otherwise we can only guess. – John Hanley Jan 05 '19 at 22:52
  • Thanks for your response. I don't think it is a problem about keeping the connection alive because it resets randomly from the other end. Also do you have any advice on how to run the program in the background? I am just running a .py file from the remote server terminal. – johnsmith13579 Jan 05 '19 at 22:54
  • I cannot answer about running TensorFlow in the background. I use a TensorFlow container and then connect to the container with a browser. This way I don't have to worry about setup, connections, background processes, etc. – John Hanley Jan 05 '19 at 23:46
  • If you run tmux or similar on your p2 you can detach from the remote session and let the session run then reattach without having your ssh tunnel failure stop the training. – Matthew Arthur Jan 06 '19 at 10:36
  • Question has nothing to do with `keras` - kindly do not spam the tag (removed). – desertnaut Jan 12 '19 at 14:13

1 Answers1

2

I figured out a partial solution to this. In the instance terminal follow the following steps.

  1. run the command "tmux"
  2. in the new shell that pops up, execute the job
  3. detach from the tmux shell by using the shortcut (Ctrl+b then d)
  4. if the ssh connection resets, ssh to the instance again and run "tmux attach"
  5. the job should have kept on running and you can resume where you left off