Problem
Under high load, our Cloud SQL Proxy occasionally hits this:
2020/06/05 13:35:47 couldn't connect to "my-cloudsql-instance": dial tcp xx.xx.xx.xx:3307: connect: connection timed out
Context
Kubernetes cluster with an Airflow pod that starts a lot of tasks in parallel with the LocalExecutor. Each of these new tasks will connect to the Airflow metadata database (which runs in Cloud SQL) through the Cloud SQL Proxy (sidecar of the Airflow pod). Every once in a while the error above happens, which causes the task to fail in Airflow.
What I've tested and found out so far:
- Under low load this never happens
- No errors or warnings are visible in the Cloud SQL instance logs
- The Airflow container, Cloud SQL Proxy container and Cloud SQL instance all have enough resources to handle this load, looking at their CPU and memory usage at the time of the error
- The max number of connections in Cloud SQL is not reached (max about 40 out of 100)
- It seems to happen after some tasks have ran and are completed and new ones are starting up
- On the Airflow side, this is visible in the log:
[2020-06-04 11:11:13,839] {taskinstance.py:1128} ERROR -
(psycopg2.OperationalError) server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
(Background on this error at: http://sqlalche.me/e/e3q8)