0

Environment: kubernetes with istio sidecars injected.

I'm using bitnami/postgresql-ha as a database for my airflow, and randomly seeing the below log in my postgresql statefulset with 3 pods (image: bitnami/postgresql-repmgr:15.3.0-debian-11-r8). Sometimes it appears 10+ times a day, sometimes only once a day, can't find any pattern.

[2023-08-18 02:41:42] [WARNING] unable to ping "user=repmgr password=admin host=airflow-postgresql-1.airflow-postgresql-headless.workflow.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5"
[2023-08-18 02:41:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-18 02:41:42] [WARNING] unable to connect to upstream node "airflow-postgresql-1" (ID: 1001)
[2023-08-18 02:41:42] [NOTICE] node "airflow-postgresql-1" (ID: 1001) has recovered, reconnecting
[2023-08-18 02:41:42] [NOTICE] reconnected to upstream node after 0 seconds

Notice: Always reconnected in 0 seconds.

And this could cause pgpool livenessProbe failed, with this event message, causing airflow tasks failed.

Liveness probe failed: Checking pgpool health... 
psql: error: connection to server on socket "/opt/bitnami/pgpool/tmp/.s.PGSQL.5432" 
failed: ERROR: unable to read message kind DETAIL: kind does not match between main(0) slot[0] (52)

I've tried:

  1. Extend the livenessProbe periodSeconds and timeoutSeconds for pgpool, but it doesn't help.
  2. Change pgpool replica count from 2 to 1 pod, but it doesn't help.
  3. set pgHbaTrustAll to true in postgresql, but it doesn't help.
  4. Change postgresql and pgpool image version (tried pgpool 4.3 and 4.4, repmgr 14 and 15), but it doesn't help.
  5. Deploy the same architechture on another k8s cluster, and it still happends.
  6. Turn off pgpool load balancing, but it doesn't help.
  7. Increase the max connection size to 10000, but it doesn't help.

I've check: the resource (cpu/memory) of all related pods are sufficent

Jasmine H
  • 1
  • 1

0 Answers0