1

I'm running a distributed Airflow setup using docker-compose. The main part of the services are run on one server and the celery workers are run on multiple servers. I have few hundred tasks running every five minutes and I started to run out of db connections which was indicated byt his error message in task logs.

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "SERVER" (IP), port XXXXX failed: FATAL:  sorry, too many clients already

I'm using Postgres as metastore and and the max_connections is set to the default value of 100. I didn't want to raise the max_connections value since I thought, that there should be a better solution for this. At some point I'll run thousands of tasks every 5 min and the number of connections is guaranteed to run out again. So I added pgbouncer to my configuration.

Here's how I configured pgbouncer

  pgbouncer:
    image: "bitnami/pgbouncer:1.16.0"
    restart: always
    environment:
      POSTGRESQL_HOST: "postgres"
      POSTGRESQL_USERNAME: ${POSTGRES_USER}
      POSTGRESQL_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRESQL_PORT: ${PSQL_PORT}
      PGBOUNCER_DATABASE: ${POSTGRES_DB}
      PGBOUNCER_AUTH_TYPE: "trust"
      PGBOUNCER_IGNORE_STARTUP_PARAMETERS: "extra_float_digits"
    ports:
      - '1234:1234'
    depends_on:
      - postgres

pgbouncer logs look like this:

pgbouncer 13:29:13.87 
pgbouncer 13:29:13.87 Welcome to the Bitnami pgbouncer container
pgbouncer 13:29:13.87 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-pgbouncer
pgbouncer 13:29:13.87 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-pgbouncer/issues
pgbouncer 13:29:13.88 
pgbouncer 13:29:13.89 INFO  ==> ** Starting PgBouncer setup **
pgbouncer 13:29:13.91 INFO  ==> Validating settings in PGBOUNCER_* env vars...
pgbouncer 13:29:13.91 WARN  ==> You set the environment variable PGBOUNCER_AUTH_TYPE=trust. For safety reasons, do not use this flag in a production environment.
pgbouncer 13:29:13.91 INFO  ==> Initializing PgBouncer...
pgbouncer 13:29:13.92 INFO  ==> Waiting for PostgreSQL backend to be accessible
pgbouncer 13:29:13.92 INFO  ==> Backend postgres:9876 accessible
pgbouncer 13:29:13.93 INFO  ==> Configuring credentials
pgbouncer 13:29:13.93 INFO  ==> Creating configuration file
pgbouncer 13:29:14.06 INFO  ==> Loading custom scripts...
pgbouncer 13:29:14.06 INFO  ==> ** PgBouncer setup finished! **

pgbouncer 13:29:14.08 INFO  ==> ** Starting PgBouncer **
2022-10-25 13:29:14.089 UTC [1] LOG kernel file descriptor limit: 1048576 (hard: 1048576); max_client_conn: 100, max expected fd use: 152
2022-10-25 13:29:14.089 UTC [1] LOG listening on 0.0.0.0:1234
2022-10-25 13:29:14.089 UTC [1] LOG listening on unix:/tmp/.s.PGSQL.1234
2022-10-25 13:29:14.089 UTC [1] LOG process up: PgBouncer 1.16.0, libevent 2.1.8-stable (epoll), adns: c-ares 1.14.0, tls: OpenSSL 1.1.1d  10 Sep 2019
2022-10-25 13:30:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:31:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:32:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:33:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:34:14.089 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:35:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:36:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:37:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:38:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:39:14.089 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us

The service seems to run ok, but I think it doesn't do anything. There was very little information about this in the Airflow documentation and I'm unsure what to change.

  1. Should I change the pgbouncer setup in my docker-compose file?
  2. Should I change AIRFLOW__DATABASE__SQL_ALCHEMY_CONN variable?

Update 1: I edited the docker-compose.yml for the worker nodes and changed the db port to be the pgbouncer port. After this I got some traffic on the bouncer logs. Airflow tasks are queued and not procesessed with this configuration so there's still something wrong. I didn't edit the docker-compose yaml that launches the webserver, scheduler etc., didn't know how.

   AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://<XXX>@${AIRFLOW_WEBSERVER_URL}:${PGBOUNCER_PORT}/airflow
   AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://<XXX>@${AIRFLOW_WEBSERVER_URL}:${PGBOUNCER_PORT}/airflow

pgbouncer log after the change:

2022-10-26 11:46:22.517 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:47:22.517 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:48:22.517 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:49:22.519 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:50:22.518 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:51:22.516 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:51:52.356 UTC [1] LOG C-0x5602cf8ab180: <XXX>@<IP:PORT> login attempt: db=airflow user=airflow tls=no
2022-10-26 11:51:52.359 UTC [1] LOG S-0x5602cf8b1f20: <XXX>@<IP:PORT> new connection to server (from <IP:PORT>)
2022-10-26 11:51:52.410 UTC [1] LOG C-0x5602cf8ab180: <XXX>@<IP:PORT> closing because: client close request (age=0s)
2022-10-26 11:51:52.834 UTC [1] LOG C-0x5602cf8ab180: <XXX>@<IP:PORT> login attempt: db=airflow user=airflow tls=no
2022-10-26 11:51:52.845 UTC [1] LOG C-0x5602cf8ab180: <XXX>@<IP:PORT> closing because: client close request (age=0s)
2022-10-26 11:51:56.752 UTC [1] LOG C-0x5602cf8ab180: <XXX>@<IP:PORT> login attempt: db=airflow user=airflow tls=no
2022-10-26 11:51:57.393 UTC [1] LOG C-0x5602cf8ab3b0: <XXX>@<IP:PORT> login attempt: db=airflow user=airflow tls=no
2022-10-26 11:51:57.394 UTC [1] LOG S-0x5602cf8b2150: <XXX>@<IP:PORT> new connection to server (from <IP:PORT>)
2022-10-26 11:51:59.906 UTC [1] LOG C-0x5602cf8ab180: <XXX>@<IP:PORT> closing because: client close request (age=3s)
2022-10-26 11:52:00.642 UTC [1] LOG C-0x5602cf8ab3b0: <XXX>@<IP:PORT> closing because: client close request (age=3s)

MrBronson
  • 582
  • 11
  • 21
  • It seems like you have shown us the part you don't need help with, and have not shown us the part you do need help with. Presumably you need to change the host, not just the port. But you also say at least some of the connections are working, so..... – jjanes Oct 26 '22 at 15:25
  • 1
    pgbouncer can be great for when you have no control over the client. But pooling is best done inside the client when you do have control over it. – jjanes Oct 26 '22 at 15:37
  • You are probably right. I realized now, that the default pool size in Airflow is 128 and the max_connections of Postgres was set to 100. So if I understood correctly this creates an issue by default. I increased the max_connections now to 250. – MrBronson Oct 27 '22 at 08:31
  • It also seems that the number of active connections never goes below 50 so I need to find some way to check if all of these connections are necessary or if some of the connections are not properly closed. – MrBronson Oct 27 '22 at 08:33

0 Answers0