0

Problem

Under high load, our Cloud SQL Proxy occasionally hits this:

2020/06/05 13:35:47 couldn't connect to "my-cloudsql-instance": dial tcp xx.xx.xx.xx:3307: connect: connection timed out

Context

Kubernetes cluster with an Airflow pod that starts a lot of tasks in parallel with the LocalExecutor. Each of these new tasks will connect to the Airflow metadata database (which runs in Cloud SQL) through the Cloud SQL Proxy (sidecar of the Airflow pod). Every once in a while the error above happens, which causes the task to fail in Airflow.

What I've tested and found out so far:

  • Under low load this never happens
  • No errors or warnings are visible in the Cloud SQL instance logs
  • The Airflow container, Cloud SQL Proxy container and Cloud SQL instance all have enough resources to handle this load, looking at their CPU and memory usage at the time of the error
  • The max number of connections in Cloud SQL is not reached (max about 40 out of 100)
  • It seems to happen after some tasks have ran and are completed and new ones are starting up
  • On the Airflow side, this is visible in the log:
[2020-06-04 11:11:13,839] {taskinstance.py:1128} ERROR -
    (psycopg2.OperationalError) server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

(Background on this error at: http://sqlalche.me/e/e3q8)
Erik Mulder
  • 81
  • 1
  • 8
  • How are you connecting to the Cloud SQL instance? are you using connection Pooling? On the CloudSQL dashboard you can see the amount of active connections could you check if you are getting close to the max connections alloweded on the instance. – Soni Sol Jun 05 '20 at 20:55

1 Answers1

1

This issue is probably caused by our custom network infra setup and not by any Google tools or services. We did find an interesting solution though: add another proxy! Whut, why? Turns out Airflow does no proper connection pooling, so connections were constantly opened and closed, including the SSL handshake and authentication / authorization overhead coming with the use of the Cloud SQL Proxy container. Hence the high load and occasional connection dropping.

We added a PgBouncer container to the pod Airflow was running on and used the proper connection pooling implemented there. All the connection opening and closing now happens over the local network inside the pod without SSL or complicated authentication, so is super fast. No more high load, no more connection dropping!

Erik Mulder
  • 81
  • 1
  • 8
  • Can you please explain your steps? thanks! – dasdasd Apr 07 '22 at 19:25
  • 1
    @dasdasd 1. Run PgBouncer (https://www.pgbouncer.org/) besides Airflow (separate container in case of kubernetes) 2. Configure the DB connection in Airflow to point to the local PgBouncer instance 3. Configure PgBouncer to connect to your 'real' Airflow DB (in our case Google CloudSql through Cloud SQL Proxy) 4. Airflow will keep on opening/closing connections to PgBouncer but PgBouncer will keep a connection pool of open connections to the remote DB. For us this setup made the connectivity issues on high load go away. – Erik Mulder Apr 11 '22 at 12:00