Environment: kubernetes with istio sidecars injected.
I'm using bitnami/postgresql-ha as a database for my airflow, and randomly seeing the below log in my postgresql statefulset with 3 pods (image: bitnami/postgresql-repmgr:15.3.0-debian-11-r8). Sometimes it appears 10+ times a day, sometimes only once a day, can't find any pattern.
[2023-08-18 02:41:42] [WARNING] unable to ping "user=repmgr password=admin host=airflow-postgresql-1.airflow-postgresql-headless.workflow.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5"
[2023-08-18 02:41:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-18 02:41:42] [WARNING] unable to connect to upstream node "airflow-postgresql-1" (ID: 1001)
[2023-08-18 02:41:42] [NOTICE] node "airflow-postgresql-1" (ID: 1001) has recovered, reconnecting
[2023-08-18 02:41:42] [NOTICE] reconnected to upstream node after 0 seconds
Notice: Always reconnected in 0 seconds.
And this could cause pgpool livenessProbe failed, with this event message, causing airflow tasks failed.
Liveness probe failed: Checking pgpool health...
psql: error: connection to server on socket "/opt/bitnami/pgpool/tmp/.s.PGSQL.5432"
failed: ERROR: unable to read message kind DETAIL: kind does not match between main(0) slot[0] (52)
I've tried:
- Extend the livenessProbe periodSeconds and timeoutSeconds for pgpool, but it doesn't help.
- Change pgpool replica count from 2 to 1 pod, but it doesn't help.
- set pgHbaTrustAll to true in postgresql, but it doesn't help.
- Change postgresql and pgpool image version (tried pgpool 4.3 and 4.4, repmgr 14 and 15), but it doesn't help.
- Deploy the same architechture on another k8s cluster, and it still happends.
- Turn off pgpool load balancing, but it doesn't help.
- Increase the max connection size to 10000, but it doesn't help.
I've check: the resource (cpu/memory) of all related pods are sufficent