0

I've been running a long job on GCE with a GPU. It is not a preemptible instance.

I was monitoring the job on a local terminal with SSH and TMUX on the instance so it keeps running if the SSH connection gets broken. The screen froze so I tried to SSH from another terminal window, but SSH also froze.

I went to the Google cloud console to try to see what is going on, and there are a lot of disk reads going on:

enter image description here

I'm pretty sure that nothing I've done has caused the disk reads.

Any idea what is going on? I hope my job is still running and I don't want to start over again so I'd rather not stop and restart my instance.

new name
  • 217
  • 1
  • 2
  • 9
  • Have you tried connecting to your VM instance using gloud: gcloud compute ssh --project [PROJECT_ID] --zone [ZONE] [INSTANCE_NAME] More info about this [command.](https://cloud.google.com/compute/docs/instances/connecting-to-instance#gcetools) If the SSH communication is unavailable, you can try accessing the VM instance through [Serial Console.](https://cloud.google.com/compute/docs/instances/interacting-with-serial-console) Also you can review the [Serial Console Output](https://cloud.google.com/compute/docs/instances/viewing-serial-port-output#viewing_serial_port_output) for more details an – Victor_Torres Jan 23 '20 at 00:38
  • 1
    I'll bet it's swapping itself into catatonia. – womble Jan 23 '20 at 02:30

1 Answers1

0

I think Womble is right that it is a memory and swap issue.

When the instance was working, I SSH'ed in and ran a small quick job, and I think that pushed the memory requirements over the edge. This condition lasted for hours so I stopped and restarted the instance.

When I started the job over from scratch the problem happened again. The job worked previously, so I'm going to wipe this instance out completely and create a new one from scratch and hope that it works again.

I can't increase memory because I'm already using the max.

new name
  • 217
  • 1
  • 2
  • 9