Vertex AI - RuntimeError: Job failed with: code: 13 message: "Internal error encountered. Please try again"

Question

I am trying to run a Vertex AI Pipeline.

The pipeline is successfully created PipelineJob created. Resource name: XXX

then i am getting a PipelineState.PIPELINE_STATE_PENDING multiples times until it crashes with this error :

Traceback (most recent call last):
  File "/src/pipelines/build_model/pipeline_run.py", line 288, in <module>
    cli()
  File "/opt/pysetup/.venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/pysetup/.venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/pysetup/.venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/pysetup/.venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/src/pipelines/build_model/pipeline_run.py", line 284, in cli
    job.run()
  File "/opt/pysetup/.venv/lib/python3.9/site-packages/google/cloud/aiplatform/pipeline_jobs.py", line 314, in run
    self._run(
  File "/opt/pysetup/.venv/lib/python3.9/site-packages/google/cloud/aiplatform/base.py", line 810, in wrapper
    return method(*args, **kwargs)
  File "/opt/pysetup/.venv/lib/python3.9/site-packages/google/cloud/aiplatform/pipeline_jobs.py", line 351, in _run
    self._block_until_complete()
  File "/opt/pysetup/.venv/lib/python3.9/site-packages/google/cloud/aiplatform/pipeline_jobs.py", line 499, in _block_until_complete
    raise RuntimeError("Job failed with:\n%s" % self._gca_resource.error)
RuntimeError: Job failed with:
code: 13
message: "Internal error encountered. Please try again"

This pipeline currently works in a dev gcp project, it automatically get into a RUNNING state.

I have this issue when i try to make it works in another gcp project. I have reproduced the same step (API enabled, service account created, same rights, same location), in my code i just change the project_id and credentials.

I have tried to change the location to check it is not due to a lack of ressource on google side. Also checked a really simple Hello World Pipeline and can't make the Pipeline go into the Running state.

I also have checked Cloud logging but can't find anything useful.

Any ideas? Thanks

Internal errors are mainly due to system errors, they are mostly transient. But since these are not very descriptive I would advise to open a [support ticket](https://www.google.com/aclk?sa=l&ai=DChcSEwiy2Yjz1uz-AhVjmmYCHcOWC9EYABABGgJzbQ&sig=AOD64_2jnoaj-Kt3pj5MUKzCPSajdYF0DA&adurl&ved=2ahUKEwj1xYLz1uz-AhVV-DgGHRlkDDgQqyQoAHoECAgQCw) with GCP or create a issue thread in GCP [public issue tracker](https://cloud.google.com/support/docs/issue-trackers) to get a precise issue description and solution. — Sakshi Gatyan, May 11 '23 at 06:53
Don't you find it weird that the pipeline doesn't even start? How can they be a system error if no node is executed? — L.GAYET, May 11 '23 at 07:49

score 1 · Accepted Answer · answered May 25 '23 at 13:17

1

I finally found out what was missing. It was some IAM permissions (concerning Cloud Storage and Bigquery in my case)

answered May 25 '23 at 13:17

L.GAYET

86
6

Can you elaborate on how you figured out which permissions you were missing? – Roy van Santen Aug 03 '23 at 09:22

score 0 · Answer 2 · answered Aug 08 '23 at 14:13

0

I got this error using a GCS bucket in a different region than the region my pipeline ran in.

answered Aug 08 '23 at 14:13

Roy van Santen

2,361
3
10
11

Vertex AI - RuntimeError: Job failed with: code: 13 message: "Internal error encountered. Please try again"

2 Answers2