Vertex AI 504 Errors in batch job - How to fix/troubleshoot

Question

We have a Vertex AI model that takes a relatively long time to return a prediction.

When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).

I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.

I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:

model_parameters = dict(batch_size=4)

def run_batch_prediction_job(vertex_config):

    aiplatform.init(
        project=vertex_config.vertex_project, location=vertex_config.location
    )

    model = aiplatform.Model(vertex_config.model_resource_name)

    model_params = dict(batch_size=4)
    batch_params = dict(
        job_display_name=vertex_config.job_display_name,
        gcs_source=vertex_config.gcs_source,
        gcs_destination_prefix=vertex_config.gcs_destination,
        machine_type=vertex_config.machine_type,
        accelerator_count=vertex_config.accelerator_count,
        accelerator_type=vertex_config.accelerator_type,
        starting_replica_count=replica_count,
        max_replica_count=replica_count,
        sync=vertex_config.sync,
        model_parameters=model_params
    )

    batch_prediction_job = model.batch_predict(**batch_params)

    batch_prediction_job.wait()

    return batch_prediction_job

I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?

Is there another way to decrease the number of instances sent to the model? Or is there a way to increase the timeout? Is there log output I can use to help figure this out? Thanks

Can you try using the aiplatform_v1 instead? Its `JobServiceClient`, has the method [create_batch_prediction_job()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.job_service.JobServiceClient?hl=it#google_cloud_aiplatform_v1_services_job_service_JobServiceClient_create_batch_prediction_job) which accepts parameters `retry` and `timeout`. Can you try using this method and adjust your `timeout` or `retry` just to check if the 504 errors decrease? — Ricco D, Feb 22 '22 at 07:04
Thanks Ricco I'm going to try that. I don't see anything in the docs describing that timeout in more detail. Is that timeout for a single instance request or a batch request. Also, is it in seconds? I'm concerned that if I make it too big, I'll incur huge billing. For the create_batch_prediction_job() I'm looking at https://github.com/googleapis/python-aiplatform/blob/main/samples/generated_samples/aiplatform_generated_aiplatform_v1_job_service_create_batch_prediction_job_sync.py — Melissa Stockman, Feb 23 '22 at 18:11

score 0 · Answer 1 · answered Feb 24 '22 at 05:42

Answering your follow up question above.

Is that timeout for a single instance request or a batch request. Also, is it in seconds?

This is a timeout for the batch job creation request.

The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.

timeout (float): The amount of time in seconds to wait for the RPC
             to complete. Note that if ``retry`` is used, this timeout
             applies to each individual attempt and the overall time it
             takes for this method to complete may be longer. If
             unspecified, the the default timeout in the client
             configuration is used. If ``None``, then the RPC method will
             not time out.

What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

1 Answers1