3

I'm deploying an application using ArgoCD. The deployment manifests include a Job that performs some one-time initialization for the application. The Job resource looks like this:

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    app.kubernetes.io/instance: house
    app.kubernetes.io/name: step-certificates
  name: create-acme-provisioner
  namespace: step-certificates
spec:
  backoffLimit: 100
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: house
        app.kubernetes.io/name: step-certificates
    spec:
      containers:
      - command:
        - /bin/bash
        - -c
        - |
          while ! step ca health; do
            echo "waiting for ca"
            sleep 1
          done

          if ! step ca provisioner list | grep -q '"name": "acme"'; then
            step ca provisioner add acme --type ACME \
              --admin-subject step \
              --password-file /home/step/secrets/passwords/password \
              --admin-provisioner "Admin JWK"
          fi
        image: cr.step.sm/smallstep/step-ca:0.22.1
        name: create-acme-provisioner
        volumeMounts:
        - mountPath: /home/step/certs
          name: certs
          readOnly: true
        - mountPath: /home/step/config
          name: config
          readOnly: true
        - mountPath: /home/step/secrets
          name: secrets
          readOnly: true
        - mountPath: /home/step/secrets/passwords
          name: ca-password
          readOnly: true
      restartPolicy: Never
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
      volumes:
      - configMap:
          name: step-certificates-certs
        name: certs
      - configMap:
          name: step-certificates-config
        name: config
      - name: secrets
        secret:
          secretName: step-certificates-secrets
      - name: ca-password
        secret:
          secretName: step-certificates-ca-password
  ttlSecondsAfterFinished: 60

It works as intended -- it will fail a couple of times while the main application is starting up, but then it runs, and everything looks great:

$ kubectl get pods
NAME                            READY   STATUS      RESTARTS   AGE
create-acme-provisioner-7zhp2   0/1     Completed   0          12s
step-certificates-0             2/2     Running     0          54m
$ kubectl get jobs
NAME                      COMPLETIONS   DURATION   AGE
create-acme-provisioner   1/1           3s         20s

The problem is that ArgoCD keeps re-syncing the Job resource.every minute, so the job runs again...and again...and so forth. The logs from the argocd-application-controller pod look like this:

time="2022-09-30T16:20:42Z" level=info msg="Initialized new operation: {&SyncOperation{Revision:114442fcfb789190cfb9e7353a636369e7113c01,Prune:true,DryRun:false,SyncStrategy:nil,Resources:[]SyncOperationResource{SyncOperationResource{Group:batch,Kind:Job,Name:create-acme-provisioner,Namespace:,},},Source:nil,Manifests:[],SyncOptions:[CreateNamespace=true],} { true} [] {-1 &Backoff{Duration:30s,Factor:*2,MaxDuration:10m,}}}" application=step-certificates-infra
time="2022-09-30T16:20:42Z" level=info msg="Tasks (dry-run)" application=step-certificates-infra syncId=00259-Dpgma tasks="[Sync/0 resource batch/Job:step-certificates/create-acme-provisioner nil->obj (,,)]"
time="2022-09-30T16:20:42Z" level=info msg="Applying resource Job/create-acme-provisioner in cluster: https://10.96.0.1:443, namespace: step-certificates"
time="2022-09-30T16:20:42Z" level=info msg="Applying resource Job/create-acme-provisioner in cluster: https://10.96.0.1:443, namespace: step-certificates"
time="2022-09-30T16:20:42Z" level=info msg="Adding resource result, status: 'Synced', phase: 'Running', message: 'job.batch/create-acme-provisioner created'" application=step-certificates-infra kind=Job name=create-acme-provisioner namespace=step-certificates phase=Sync syncId=00259-Dpgma
time="2022-09-30T16:21:45Z" level=info msg="Initialized new operation: {&SyncOperation{Revision:114442fcfb789190cfb9e7353a636369e7113c01,Prune:true,DryRun:false,SyncStrategy:nil,Resources:[]SyncOperationResource{SyncOperationResource{Group:batch,Kind:Job,Name:create-acme-provisioner,Namespace:,},},Source:nil,Manifests:[],SyncOptions:[CreateNamespace=true],} { true} [] {-1 &Backoff{Duration:30s,Factor:*2,MaxDuration:10m,}}}" application=step-certificates-infra
time="2022-09-30T16:21:45Z" level=info msg="Tasks (dry-run)" application=step-certificates-infra syncId=00260-KsLXq tasks="[Sync/0 resource batch/Job:step-certificates/create-acme-provisioner nil->obj (,,)]"
time="2022-09-30T16:21:45Z" level=info msg="Applying resource Job/create-acme-provisioner in cluster: https://10.96.0.1:443, namespace: step-certificates"
time="2022-09-30T16:21:45Z" level=info msg="Applying resource Job/create-acme-provisioner in cluster: https://10.96.0.1:443, namespace: step-certificates"
time="2022-09-30T16:21:45Z" level=info msg="Adding resource result, status: 'Synced', phase: 'Running', message: 'job.batch/create-acme-provisioner created'" application=step-certificates-infra kind=Job name=create-acme-provisioner namespace=step-certificates phase=Sync syncId=00260-KsLXq
time="2022-09-30T16:22:49Z" level=info msg="Initialized new operation: {&SyncOperation{Revision:114442fcfb789190cfb9e7353a636369e7113c01,Prune:true,DryRun:false,SyncStrategy:nil,Resources:[]SyncOperationResource{SyncOperationResource{Group:batch,Kind:Job,Name:create-acme-provisioner,Namespace:,},},Source:nil,Manifests:[],SyncOptions:[CreateNamespace=true],} { true} [] {-1 &Backoff{Duration:30s,Factor:*2,MaxDuration:10m,}}}" application=step-certificates-infra
time="2022-09-30T16:22:49Z" level=info msg="Tasks (dry-run)" application=step-certificates-infra syncId=00261-itFqU tasks="[Sync/0 resource batch/Job:step-certificates/create-acme-provisioner nil->obj (,,)]"
time="2022-09-30T16:22:49Z" level=info msg="Applying resource Job/create-acme-provisioner in cluster: https://10.96.0.1:443, namespace: step-certificates"
time="2022-09-30T16:22:49Z" level=info msg="Applying resource Job/create-acme-provisioner in cluster: https://10.96.0.1:443, namespace: step-certificates"
time="2022-09-30T16:22:49Z" level=info msg="Adding resource result, status: 'Synced', phase: 'Running', message: 'job.batch/create-acme-provisioner created'" application=step-certificates-infra kind=Job name=create-acme-provisioner namespace=step-certificates phase=Sync syncId=00261-itFqU

Why is ArgoCD re-syncing this resource, and how do I get it to stop?

larsks
  • 43,623
  • 14
  • 121
  • 180
  • I would look into resource hooks: https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/ . Apply that job during PostSync, with condition HookSucceeded. Although it may still re-create that job, at least rolling out another change/to test .... answering how to make it stop: figure out what diff argocd sees, that would trigger re-creation: get your job object after completion, check for diffs (outside of ownerRefs/metadata/status). – SYN Oct 01 '22 at 09:22

1 Answers1

3

I figured out what was going on.

The Job was configured with ttlSecondsAfterFinished, which is documented here. I had misread the documentation and thought this would clean up the Pods created by the job, but in fact it causes the Job itself to be removed.

Because the Job was managed by ArgoCD, when it was deleted due to the ttlSecondsAfterFinished setting ArgoCD would prompt re-create it.

As @SYN suggested in a comment, an alternative solution is to configure the Job as an ArgoCD PostSync hook with a hook-delete-policy:

apiVersion: batch/v1
kind: Job
metadata:
  name: create-acme-provisioner
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:

When ArgoCD successfully syncs the application, it will create this job, and when the job is successful, ArgoCD will delete it.

This means the job runs once on every sync, but that's fine. It's no longer running every 60 seconds.

larsks
  • 43,623
  • 14
  • 121
  • 180