Why does my 32 core GCE instance keep stopping and how can I debug it?

Question

I am experimenting with google compute engine (GCE) to run some scientific/engineering software. I have successfully tested the system on GCE using an 8 core, non-preemptible instance running over a long time period. I'm now testing with a 32 core preemptible instance but finding that the instance stops running after a relatively short time (certainly less than an hour). Although it's preemptible, I was under the impression from the docs that it was relatively unlikely to be stopped under typical circumstances.

I would like to know if there is some way to determine why the instance was stopped (I don't see any kind of log, at least in the web interface), get suggestions for causes of this, and suggestions for remedies or ways to prevent it.

In case it is relevant I'm in the trial period of GCE using free credit. By default you are only supposed to have a maximum of 24 cores, but I requested a quota increase to 32 cores so I could test my system on this instance type.

I am going to attempt a run with a non-preemptible instance to see if this makes any difference. I'll update this question with an edit later to report the results of this.

It would be helpful if whoever downvoted my question gave a reason so I could improve future questions. — crobar, Feb 06 '16 at 12:39
Welcome to Server Fault! By philosophy and design votes are anonymous and **neither voting [up](http://serverfault.com/help/privileges/vote-up) nor voting [down](http://serverfault.com/help/privileges/vote-down) requires any mandatory explanation**. The tooltip that appears when your mouse pointer hoovers over the down button states *"this question does not show any research effort; it is unclear or not useful"*. Also questions can attract a down vote when not [well written](http://meta.serverfault.com/a/3609/37681) not quite [on-topic](http://serverfault.com/help/on-topic) or missing details. — HBruijn, Feb 07 '16 at 18:59

score 2 · Accepted Answer · answered Feb 06 '16 at 04:50

2

Preemptible VM's are subject to availability of excess capacity in Google's data-centers. Some regions/zones are more popular than the others (like us-central1 is more popular than asia-east1 and less likely to have that excess capacity over longer periods of time.

If you can use another region/zone for your instance, try to experiment with other regions and zones and empirically check if they have more preemptible instances available.

Keep in mind that preemptible instances should only be used for stateless applications, otherwise your data or service will be lost.

answered Feb 06 '16 at 04:50

DoiT International

231
1
6

Thanks, I will investigate alternative regions. In my case I can save the 'state' regularly in a small file and pick up from that state, but I'd like to get at least a few hours between instances stopping as restart time is non-negligible. – crobar Feb 06 '16 at 12:22
1

@crobar You can utilize [Instance Groups](https://cloud.google.com/compute/docs/instance-groups/) to automatically replace expiring preemptible instances with another one so that you don't have to individually control each instance in your project. You can even specify a min. number of instances in your group and Instance Group Manager will make sure you have this amount at all times. If your app is allowing that, think about running two 16-core preemptible instances instead of single 32-core to maintain better availability of your process. – DoiT International Feb 06 '16 at 12:33
I will look into this as well, in my case I am running an optimisation program that uses [HTCondor](https://research.cs.wisc.edu/htcondor/) operating as a 'personal' installation on the instance to control multiple jobs. Ideally I could instead have multiple instances each being part of an HTCondor cluster. At first it would just be easier to restart a stopped instance though. – crobar Feb 06 '16 at 12:46

Why does my 32 core GCE instance keep stopping and how can I debug it?

1 Answers1