How to fix k8s PGO operator network issue and crash loop-back?

Question

I am trying to install PGO operator by following this Docs. When I run this command

kubectl apply --server-side -k kustomize/install/default

my Pod run and soon it hit to crash loop back.

What I have done I check the logs of Pods with this command

kubectl logs pgo-98c6b8888-fz8zj -n postgres-operator

Result

time="2023-01-09T07:50:56Z" level=debug msg="debug flag set to true" version=5.3.0-0
time="2023-01-09T07:51:26Z" level=error msg="Failed to get API Group-Resources" error="Get \"https://10.96.0.1:443/api?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeout" version=5.3.0-0
panic: Get "https://10.96.0.1:443/api?timeout=32s": dial tcp 10.96.0.1:443: i/o timeout

goroutine 1 [running]:
main.assertNoError(...)
        github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:42
main.main()
        github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:84 +0x465

To check the network connection to host I run this command

wget https://10.96.0.1:443/api

The Result is

--2023-01-09 09:49:30--  https://10.96.0.1/api
Connecting to 10.96.0.1:443... connected.
ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
  Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.

As you can see it is connected to API

Strange issue might be useful to help me

I run kubectl get pods --all-namespaces and see this output

NAMESPACE           NAME                                                READY   STATUS             RESTARTS         AGE
kube-flannel        kube-flannel-ds-9gmmq                               1/1     Running            0                3d16h
kube-flannel        kube-flannel-ds-rcq8l                               0/1     CrashLoopBackOff   10 (3m15s ago)   34m
kube-flannel        kube-flannel-ds-rqwtj                               0/1     CrashLoopBackOff   10 (2m53s ago)   34m
kube-system         etcd-masterk8s-virtual-machine                      1/1     Running            1 (5d ago)       3d17h
kube-system         kube-apiserver-masterk8s-virtual-machine            1/1     Running            2 (5d ago)       3d17h
kube-system         kube-controller-manager-masterk8s-virtual-machine   1/1     Running            8 (2d ago)       3d17h
kube-system         kube-scheduler-masterk8s-virtual-machine            1/1     Running            7 (5d ago)       3d17h
postgres-operator   pgo-98c6b8888-fz8zj                                 0/1     CrashLoopBackOff   7 (4m59s ago)    20m

As you can see my two kube-flannel Pods are also in crash loop-back and one is running. I am not sure if this is the main cause of this problem

What I want? I want to run the PGO pod successfully with no error.

How you can help me? Please help me to find the issue or any other way to get detailed logs. I am not able to find the root cause of this problem because, If it was network issue then why its connected? if its something else then how can I find the information?

Update and New errors after apply the fixes:

time="2023-01-09T11:57:47Z" level=debug msg="debug flag set to true" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Metrics server is starting to listen" addr=":8080" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="upgrade checking enabled" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="starting controller runtime manager and will wait for signal to exit" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting server" addr="[::]:8080" kind=metrics path=/metrics version=5.3.0-0
time="2023-01-09T11:57:47Z" level=debug msg="upgrade check issue: namespace not set" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1beta1.PostgresCluster" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.ConfigMap" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Endpoints" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.PersistentVolumeClaim" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Secret" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Service" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.ServiceAccount" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Deployment" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.StatefulSet" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Job" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Role" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.RoleBinding" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.CronJob" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.PodDisruptionBudget" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Pod" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.StatefulSet" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting Controller" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster version=5.3.0-0
W0109 11:57:48.006419       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:48.006642       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
time="2023-01-09T11:57:49Z" level=info msg="{\"pgo_versions\":[{\"tag\":\"v5.1.0\"},{\"tag\":\"v5.0.5\"},{\"tag\":\"v5.0.4\"},{\"tag\":\"v5.0.3\"},{\"tag\":\"v5.0.2\"},{\"tag\":\"v5.0.1\"},{\"tag\":\"v5.0.0\"}]}" X-Crunchy-Client-Metadata="{\"deployment_id\":\"288f4766-8617-479b-837f-2ee59ce2049a\",\"kubernetes_env\":\"v1.26.0\",\"pgo_clusters_total\":0,\"pgo_version\":\"5.3.0-0\",\"is_open_shift\":false}" version=5.3.0-0
W0109 11:57:49.163062       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:49.163119       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:57:51.404639       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:51.404811       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:57:54.749751       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:54.750068       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:58:06.015650       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:58:06.015710       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:58:25.355009       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:58:25.355391       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:59:10.447123       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:59:10.447490       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
time="2023-01-09T11:59:47Z" level=error msg="Could not wait for Cache to sync" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster error="failed to wait for postgrescluster caches to sync: timed out waiting for cache to be synced" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for non leader election runnables" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for leader election runnables" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for caches" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=error msg="failed to get informer from cache" error="Timeout: failed waiting for *v1.PodDisruptionBudget Informer to sync" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=error msg="error received after stop sequence was engaged" error="context canceled" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for webhooks" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Wait completed, proceeding to shutdown the manager" version=5.3.0-0
panic: failed to wait for postgrescluster caches to sync: timed out waiting for cache to be synced

goroutine 1 [running]:
main.assertNoError(...)
        github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:42
main.main()
        github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:118 +0x434

Veera Nagireddy · Answer 1 · 2023-01-12T05:56:17.023

If this is a new deployment, I suggest using v5.

That said, as PGO manages the networking for Postgres clusters (and as such, manages listen_adresses), there's no reason to modify the listen_addresses configuration parameter. If you need to manage networking or networking access, you can do that by setting the pg_hba config or using NetworkPolicies.

Please go through the Custom 'listen_addresses' not applied #2904 for more information.

CrashLoopBackOff: Check the pod logs for configuration or deployment issues such as missing dependencies (Like : kubernetes engine doesn't support docker-compose depends-on, so now we are using kubernetes + docker without nginx) and also check for pods being OOM killed and excessive resource usage.

Check for the timeout issues and also lab on timeout problem

ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
  Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.

Try solution for the above Error : first, remove ip link flannel.1 on every hosts which has this problem

secondly, delete kube-flannel-ds from k8s

last, recreate kube-flannel-ds from k8s, flannel.1 ip link will recreated and return back good.

(For flannel to work correctly, you must pass --pod-network-cidr=10.244.0.0/16 to kubeadm init.(I mean Change CIDR).)

Edit :

Please check similar issue and solution ,which may help to resolve your issue.

Hi @Veera Nagireddy, Thanks for your Answer. I reinstall flannel and it still not fixed. Then I reset the kubeadm and rejoin all the nodes again one by one. It fixed the flannel issue and now no crash loop back for flannel. However the PGO still not fixed but show different error. I just updated the Question with new errors. Regarding API version I am using V5 — tauqeerahmad24, Jan 09 '23 at 12:00
Check this [Link](https://forums.percona.com/t/pod-crashlooping-with-failed-to-sync-secret-cache-timed-out-waiting-for-the-condition/13844) may help to resolve your issue. — Veera Nagireddy, Jan 09 '23 at 12:42
Also check what your cluster version is and what your upgrade process looks like? The recommended process will be 1) upgrade master node 2) upgrade worker node 3) upgrade addons (eg. CSI driver) — Veera Nagireddy, Jan 09 '23 at 13:12
Thanks Veera, I will update you very soon once I try and use your recommendation. — tauqeerahmad24, Jan 12 '23 at 07:51

How to fix k8s PGO operator network issue and crash loop-back?

1 Answers1