GCP App Engine Flexible Environment Slow Disk I/O

Question

Background/Overview:

We have a Python Web Application with three environments in two projects staging/dev and production. These are hosted in GCP using Google App Engine Flexible Environment, Cloud SQL and a GCP bucket for static files. These environments function well and are very responsive. We decided to break the staging/dev environments into two separate projects so we can tune the staging environment to be closer to production.

The Problem:

We set up the dev project in the same fashion as our staging environment. However, the instance becomes unresponsive after a few requests with the Gunicorn workers eventually timing out. Upon further investigation using Stack Driver and top we noticed that the App Engine instance is spending a lot of time waiting for I/O. Top continually reports that the CPU wait time (wa) is around 90%. Additionally, when we attempt to SSH into the instance we find that it takes a long time to connect, some times 5 or more minutes, and other times it never manages to connect at all. Once we get in, the experience in the shell is very lagged. Using docker commands, such as docker container ls at times takes several minutes to return a list of containers and on occasion has not returned results and resulted in the instance restarting, on the other hand using top works well. When I manage to get into the docker container running our app the experience is much the same but it's difficult to track down what I/O is causing issues as the shell quickly becomes unresponsive. Additionally, if we let the application sit idle for about half an hour we can shell into the instance consistently but upon shelling (docker exec -it [id] /bin/bash) into the application's docker container things become unresponsive quickly.

What's Observed

Memory usage ~800Mb with ~200Mb free. No virtual memory used.
CPU usually only in the single digits of usage, mostly system (not counting waiting for I/O).
Database CPU and Ram usage extremely low. The database also uses an SSD.
The Google Cloud Bucket is correctly configured and populated with static files upon deployment.
Attempting to delete instances and have them rebuild does not resolve the issue.
Pushing new versions does not resolve the issue.
Increasing the timeout of the Gunicorn workers helps keep the application from returning errors but eventually the requests begin to time out or the instance will restart.
We've verified that the environment variables have been correctly updated to reflect the new GCP project name, bucket and database.
The two other environments running the same code function well.

Thoughts/TL;DR

We have something off with the configuration of our new Dev Environment/Project that appears to be causing slow I/O speeds. Previously this configuration ran fine in parallel with our staging environment in a different project. Our ducks appear to be in a row from everything we see and can think of but we're obviously missing something. Any assistance and ideas would be greatly appreciated.

What IOs are happening that leaves so many processes in iowait? Modern Linux can trace this in detail, see for example the biosnoop eBPF script http://manpages.ubuntu.com/manpages/bionic/man8/biosnoop-bpfcc.8.html — John Mahowald, Jul 05 '19 at 18:48
Thanks for your reply John. Unfortunately the instance becomes unresponsive very quickly once I shell into the docker container so I'm unable to do any proper command line debugging. I've managed to get top running but if I attempt to do pretty much anything else the shell freezes. The strange thing is that the app can be idle with no requests but the moment I shell into the docker container it will quickly become unresponsive. — Bradley, Jul 10 '19 at 17:38

Bradley · Answer 1 · 2019-07-25T17:40:50.773

The solution was to create a new project. We set it up in the same way, except using AppEngine Standard instead of Flex. We then just updated the environment variables to use the the new project-id, bucket-id and service account key and everything worked.

Either we had a configuration option wrong in our old project somewhere or something odd happened on GCP's end. I still find it highly strange that a new instance with no traffic (and therefor none of our code was executing) would die just by running apt-get when shelling into the container.

GCP App Engine Flexible Environment Slow Disk I/O

1 Answers1