Kubeflow - Job Finished executing but is still running

Question

My Kubeflow pipeline components/jobs continue to run indefinitely even though the main execution has finished. From these logs, might folks see why the job won't finish successfully?

It seems that there is a wait container that continues to run, even though the main container has successfully completed.

Any insight is much appreciated

  
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/secondary-market-pipeline-6plbl-940127540 to gke-cluster-1-pool-1-46a6353b-wfpg
  Normal  Pulled     10m   kubelet            Container image "gcr.io/cloud-marketplace/google-cloud-ai-platform/kubeflow-pipelines/argoexecutor:1.7.1" already present on machine
  Normal  Created    10m   kubelet            Created container wait
  Normal  Started    10m   kubelet            Started container wait
  Normal  Pulling    10m   kubelet            Pulling image "<image>:latest"
  Normal  Pulled     10m   kubelet            Successfully pulled image "<image>:latest" in 1.617667035s
  Normal  Created    10m   kubelet            Created container main
  Normal  Started    10m   kubelet            Started container main

I've narrowed this to the use of high mem nodes - but not sure why those aren't able to wrap up successfully — ashemag, Nov 15 '21 at 16:34

score 2 · Answer 1 · answered Nov 15 '21 at 16:47

2

My solution was found in this thread https://github.com/kubeflow/pipelines/issues/6793

where my highmem nodes were not "Containers optimized for Docker" and needed to be. Creating a new node pool with that fixed.

answered Nov 15 '21 at 16:47

ashemag

86
4

Kubeflow - Job Finished executing but is still running

1 Answers1