3

Been using a GCP preemptible VM for a few months without problems, but in the last 4 weeks my instances have consistently shut off anywhere from 10 minutes to 20 minutes into operation.

I'll be in the middle of training, and my notebook will suddenly disconnect. The terminal will show this error:

jupyter@fastai-instance:~$ Connection to 104.154.142.171 closed by remote host.
Connection to 104.154.142.171 closed.
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

I then check the status of my VM, to see that it has shutdown.

I searched the terminal traceback and found this thread, which seemed promising: ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255]

When I ran sudo gcloud compute config-ssh my VM ran for much longer than usual before shutting down, yet shutdown in the same way after about an hour. Since then, back to the same behavior.

I know preemptible instances can be shutdown when the platform needs resources, but my understanding is that comes with some kind of warning. I've checked the status of GCP's servers after shutdowns and they appear to be fine. This is also happening the same way every time I turn my VM on, which seems too frequent for preempting.

I am not sure where to look for any clues – has anyone else had a problem like this? What's especially puzzling to me is, if it is in fact an SSH problem, why would that cause the VM itself to shutdown, rather than just break the connection?

Thanks very much for any help!

buttchurch
  • 41
  • 1
  • 5
  • Preemptable VMs may not be the best choice for interactive (e.g. shell) workloads -- generally they are useful for large batch processes that can checkpoint their work easily. – robsiemb Oct 06 '19 at 15:03
  • `gcloud compute config-ssh` is not related. That command populates `/.ssh/config with details on your compute instances to make ssh connections easier for humans. – John Hanley Oct 06 '19 at 15:46
  • @JohnHanley I see – thanks for the clarification. I've been just following the instructions on ssh commands without really understanding how they work or what is going on under the hood, so thanks for your help. – buttchurch Oct 07 '19 at 13:03

3 Answers3

2

Did you try to set a shutdown script and to print something in a file for validating the state of the VM when it goes down ?

Try this as shutdown script

#!/bin/bash

curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted" -H "Metadata-Flavor: Google" > /tmp/preempted.log

If there is TRUE in the file, it's because the VM has been preempted.

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • Thanks! I will try this and see if it is in fact preempting :) I'll report back tomorrow! – buttchurch Oct 07 '19 at 12:55
  • Your link was super helpful! It lead me to the monitoring logs of the instance, which showed all the dozens of times it has shutdown to indeed be the VM preempting. Thanks for you help :) – buttchurch Oct 08 '19 at 09:33
2

If a VM stops and you have an active SSH connection to that VM (via gcloud compute ssh), then it's normal that you are receiving an error. Since the VM goes down, all connections are closed, so does your SSH connection (you cannot connect to a stopped instance). The VM termination causes the SSH error, not the opposite.

When using preemptible instances, Google can reclaim the instance whenever it's needed. Note that (from the docs about preemptible instances limitations) :

Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.

It means that one day, your instance may be running for 24 hours without being terminated, but an other day, your instance may be stopped 30 minutes after being started if Compute Engine needs to reclaim some resources.

norbjd
  • 10,166
  • 4
  • 45
  • 80
  • Thanks for the link to the documentation – I see some instructions in there on how to verify whether my instance has been preempting. That's a good first step to try! – buttchurch Oct 07 '19 at 13:00
  • I looked and you're correct – the ssh error was being severed because my VM was preempting. I have a follow-up question for you: this VM was not preempting at all for the first month or two I used it, and it has been preempting consistently for the past few weeks now in the same way. To put it simply, it just never stays on for more than 30 minutes at a time. Should I switch regions or something? Any idea what would cause that change? Thanks very much. – buttchurch Oct 08 '19 at 09:32
  • [From the docs about preemption selection](https://cloud.google.com/compute/docs/instances/preemptible#preemption_selection) : *For reference, we've observed from historical data that the average preemption rate varies between 5% and 15% per day per project, on a seven-day average, occasionally spiking higher depending on time and zone*. Based on this, changing zone (or time when the instances are running) may have an impact on preemption. – norbjd Oct 11 '19 at 18:54
  • By the way, preemptible instances are suitable for batch processing, when a VM failure can easily be recovered by starting a new instance that will replace the preempted one. For interactive processing (like notebooks as you mentioned), using preemptible VMs can cause you more trouble than non-preemptible VMs, as you have experienced. – norbjd Oct 11 '19 at 18:57
2

A comment on the "continuously shutting down" part: (I have experienced this as well)

Keep in mind that Google prefers to shut down RECENTLY STARTED preemptible instances, over ones started earlier.

The link below (and supplied earlier) has the statement:

Generally, Compute Engine avoids preempting too many instances from a single customer and preempts new instances over older instances whenever possible.

This would generally mean that, yes, I suppose, if you are preempted, and boot up again, it is quite likely that you are going to be preempted again and again until the load in the zone reduces.

I'm surprised that Google don't simply preclude you starting the preemptible VM for a while (like 30-60 minutes?). - How much CPU is being wasted bouncing VMs up and down and crossing our fingers???

P.S. There is a dirty trick to end-around your frustration - Have 2 VMs identically configured, except for preemptibility, but only 1 underlying book disk. If you are having a bad day with preempts, simply 'move' the boot disk to the non-preemptible VM, boot it, and carry on. - It's a couple of simple gcloud commands to achieve this, easily scripted and very fast. Don't tell Google I told ya....

https://cloud.google.com/compute/docs/instances/preemptible#limitations

spechter
  • 2,058
  • 1
  • 17
  • 23