0

I use customer docker containers to run dataflow jobs. I want to chain it together with my tpu training job etc, so I'm considering running kubeflow pipeline on vertex ai. Is this a sensible idea? (There seems to be many alternatives like airflow etc.)

In particular, must I use DataflowPythonJobOp in the pipeline? It does not seem to support custom worker images. I assume I can just have one small machine, which launches the dataflow pipeline and stays idle (besides writing some logs) until the dataflow pipeline finishes?

bill
  • 650
  • 8
  • 17

1 Answers1

1

Have you tried to pass the custom container args with https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/dataflow.html#v1.dataflow.DataflowPythonJobOp.args?

XQ Hu
  • 141
  • 4
  • custom container at best is only used at workers? if my `python_module_path` refers to custom libraries or pip packages not found on gcpc (google cloud platform component) image, launching the dataflow job (which happens on gcpc) will fail in the first place? – bill Jul 19 '23 at 20:19
  • You could move the import parts inside the Beam pipeline to avoid the launching error. – XQ Hu Jul 20 '23 at 21:15
  • @bill - Have you managed to sort it out .I also have the same requirement .If I use https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/dataflow.html#v1.dataflow.DataflowPythonJobOp vertex AI pipeline fails as it tries to pull Apache Beam from pypi.org .We don't have access to public repo .Even if I create a container image for the dataflow pipeline and use it in pipeline options , it still gets stuck in the 'from apache_beam import PipelineOptions' . As it does not have apache beam in GCPC – ForeverStudent Aug 25 '23 at 01:57