1

I have a Java Spring Boot app that was previously building well, and we are now having issues.

We are using GCP, and the cloud build feature to trigger builds automatically when we push to certain branches in GCP. The goal is for the app to build itself, then deploy to app engine. In various iterations before much trial and error we were doing this successfully.

The app builds and deploys successfully. Meaning if I push code, it builds and works. But the cloud build tool keeps reporting that the build failed.

Our cloudbuild.yaml

steps:
- id: 'Stage app using mvn appengine plugin on mvn cloud build image'   
  name: 'gcr.io/cloud-builders/mvn'
  args: ['package', 'appengine:stage', '-Dapp.stage.appEngineDirectory=src/main/appengine/$_GAE_YAML', '-P cloud-gcp']
  timeout: 1600s
- id: "Deploy to app engine using gcloud image"
  name: 'gcr.io/cloud-builders/gcloud'
  args: ['app', 'deploy', 'target/appengine-staging/app.yaml',
         '-q', '$_GAE_PROMOTE', '-v', '$_GAE_VERSION']
  timeout: 1600s
- id: "Splitting Traffic"
  name: 'gcr.io/cloud-builders/gcloud'
  args: ['app', 'services', 'set-traffic', '--splits', '$_GAE_TRAFFIC']
timeout: 3200s

For reference here is an app.yaml

runtime: java
env: flex
runtime_config:
  jdk: openjdk8
env_variables:
  SPRING_PROFILES_ACTIVE: "dev"
handlers:
  - url: /.*
    script: this field is required, but ignored
    secure: always
manual_scaling:
  instances: 1
resources:
  cpu: 2
  memory_gb: 2
  disk_size_gb: 10
  volumes:
    - name: ramdisk1
      volume_type: tmpfs
      size_gb: 0.5

The first step completes just fine, or seemingly so.

The app becomes available on that specific version and runs just fine.

Here is the current "failure" we are facing, found in the output of the failed builds in the second step:

--------------------------------------------------------------------------------
Updating service [default] (this may take several minutes)...

ERROR: (gcloud.app.deploy) Error Response: [9] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2021-11-04T14:55:50.087Z257173.in.0:
There was an error while pulling the application's docker image: the image does
not exist, one of the image layers is missing or the default service account
does not have  permission to pull the image. Please check if the image exists.
Also check if the default service account has the role Storage Object Viewer
(roles/storage.objectViewer) to pull images from Google Container
Registry or Artifact Registry Reader (roles/artifactregistry.reader) to pull
images from Artifact Registry. Refer to https://cloud.google.com/container-registry/docs/access-control
in granting access to pull images from GCR. Refer to https://cloud.google.com/artifact-registry/docs/access-control#roles
in granting access to pull images from Artifact Registry.

We have been having pretty consistent issues with the caching of builds, to the point where in the past we push new code and it launches old versions of the code. I think it may all be related.

We have tried clearing the entire container registry cache for the specific version of the app, and that is when this specific issue started occuring. I have a feeling it is just building and launching one version of the app, then going back and trying to launch a different version of the app right on top of that. Looking for a way to at least get more verbose logging but this is mostly where I am stuck.

How do I go about adjusting the "name: 'gcr.io/cloud-builders/gcloud'" step to properly indicate that a deployment worked? Is that the right approach?

Mark Amber
  • 23
  • 3
  • 1
    I think you deleted some portion of the image cache that you are not supposed to do. You seem to have deleted the image layer files from GCR directly, which resulted in not properly cleaning the App Engine cache. Most likely the cache metadata still believes that image layers are there in GCR and points to the missing files you deleted. That said, how about trying [`gcloud app deploy --no-cache`](https://cloud.google.com/sdk/gcloud/reference/app/deploy)? – Chanseok Oh Nov 08 '21 at 17:38
  • @chanseok-oh Oh I do at some point want to know exactly what happened here. As you can see in the answer I posted I did resolve this by changing the port of the java app. Not exactly sure how/why that worked so well. Any ideas? – Mark Amber Nov 16 '21 at 00:11
  • 1
    I don't know for sure. My wild speculation is that, since you updated a file (`application.properties`), it results in creating a different container image layer, so the cache corruption on the previous image doesn't apply anymore. – Chanseok Oh Nov 23 '21 at 17:14

2 Answers2

0

The error response code 9 (application startup error) is a fairly general error message indicating that the deployed program failed to start up for whatever reason, and so is not running properly (or the VM believes so). According to what you indicated, the app appears to be deployed to the VM, but the VM falls down after a while owing to the app failing to start.

For additional information on why it is crashing, check at the server logs in the Cloud Console.

After updating the gcloud components with the gcloud components update command, try deploying your app.

Make sure the SDK is running as an administrator.

If the error persists, try running the command gcloud app deploy app.yaml —verbosity=debug to see if you can get a more specific error.

Now according to the error message, there seems to be an issue with the docker image, that suggests checking if the image exists. I found Docker documentation that could help you according to the error message. Following the error, it also mentioned that the service account does not have permissions to pull the image, here is also how to require permissions documentation.

Configuring access control documentation that helps with permissions and roles, granting IAM permissions and Configuring public access to images.

Additional Artifact Registry documentation that was recommended at the end of the error message is Google Cloud's recommended solution for container image storage and management.

Artifact Registry expands the capabilities of Container Registry by providing a fully managed service that supports both container images and non-container artifacts.

0

Answering my own question here.

It turns out that the application was deploying but listening on the wrong port. We just added server.port=8080 to the application.properties file and things started working again.

I do believe what Chanseok Oh mentioned in the comment above on my question was also true. Although changing the port seemed to be the one and only thing that solved this.

GCP was trying to do a readiness check, and was getting nothing back. It is unclear why this was related at all to the cache of the artifacts, if at all.

Mark Amber
  • 23
  • 3
  • 1
    It may not be that the port was wrong. (Or maybe it was.) A wild speculation of mine is that you updated a file (`application.properties`) and it results in creating a different container image layer. So the cache corruption on the previous image doesn't apply anymore. But I may be wrong about this. – Chanseok Oh Nov 23 '21 at 17:17
  • @ChanseokOh Having this issue again. I feel like I keep breaking through further. I downloaded the image off us.artifacts.[app-id].appspot.com and that shows it is getting the new files. The docker image in the container registry contains the correct app.jar too. But after `Updating service [default] (this may take several minutes)...` section in gcloud deploy it spits out CLEARLY not the right version. Happy to report this somewhere official... I will look and try to see how/where. – Mark Amber Mar 18 '22 at 00:24
  • @ChanseokOh ACTUALLY figured it out. If starting the image with `gcloud deploy` succeeds (aka, does not exit 1) but the deployment itself is locked up, then the next time you `gcloud deploy` that version without manually stopping all instances will fail... Presumably the new image is just not pulled, and has something to do with the application not working. My workaround is just to not use versions the way I am using versions but create more projects I guess... – Mark Amber Mar 19 '22 at 18:32