On GCE, sudden disk I/O and can no longer SSH in

Question

I've been running a long job on GCE with a GPU. It is not a preemptible instance.

I was monitoring the job on a local terminal with SSH and TMUX on the instance so it keeps running if the SSH connection gets broken. The screen froze so I tried to SSH from another terminal window, but SSH also froze.

I went to the Google cloud console to try to see what is going on, and there are a lot of disk reads going on:

I'm pretty sure that nothing I've done has caused the disk reads.

Any idea what is going on? I hope my job is still running and I don't want to start over again so I'd rather not stop and restart my instance.

Have you tried connecting to your VM instance using gloud: gcloud compute ssh --project [PROJECT_ID] --zone [ZONE] [INSTANCE_NAME] More info about this [command.](https://cloud.google.com/compute/docs/instances/connecting-to-instance#gcetools) If the SSH communication is unavailable, you can try accessing the VM instance through [Serial Console.](https://cloud.google.com/compute/docs/instances/interacting-with-serial-console) Also you can review the [Serial Console Output](https://cloud.google.com/compute/docs/instances/viewing-serial-port-output#viewing_serial_port_output) for more details an — Victor_Torres, Jan 23 '20 at 00:38

score 0 · Accepted Answer · answered Jan 23 '20 at 12:46

I think Womble is right that it is a memory and swap issue.

When the instance was working, I SSH'ed in and ran a small quick job, and I think that pushed the memory requirements over the edge. This condition lasted for hours so I stopped and restarted the instance.

When I started the job over from scratch the problem happened again. The job worked previously, so I'm going to wipe this instance out completely and create a new one from scratch and hope that it works again.

I can't increase memory because I'm already using the max.

On GCE, sudden disk I/O and can no longer SSH in

1 Answers1