How to set sagemaker retry policy for pipeline steps for AlgorithmError

Question

On a sagemaker project, i have a pipeline containing several steps. In particular, the batch transform step has some internal problems which occur infrequently and i am therefore unable to replicate the error.

This step goes into "AlgorithmError" after 16 long hours (i have not set any timeout), even though the problem occurs immediately.

The current solution is that after the step has failed, i can do a manual retry of the pipeline (e.g. the "retry" button on SageMaker Studio) and restart this step with a good chance that it will not fail again. For the moment it always works, but i would like to automate this retry. Clearly, this is a way around the problem while waiting for the final solution.

I have seen that there is a "Retry Policy for Pipeline Steps" and batch transform jobs also apply, but I have not quite understood which of the "Supported exception types" is suitable and how to configure it for "AlgorithmError".

To put it in context, this is my simple pipeline step code (so configurational parts of retry, retries etc. are all by default):

from sagemaker.transformer import Transformer
from sagemaker.workflow.steps import TransformStep
from sagemaker.inputs import TransformInput

transformer = Transformer(
    model_name=my_model.properties.ModelName,
    instance_count=test_processor_instance_count,
    instance_type=test_processor_instance_type,
    output_path=my_output_test_path,
)
        
step_transform = TransformStep(
    name="TestInference",
    description="Test Inference",
    transformer=transformer,
    inputs= TransformInput(
        data=my_data_s3_uri,
        content_type = 'application/octet-stream',
        data_type = 'S3Prefix'
    )  
)

Is it possible to avoid waiting 16 hours before it fails? The whole step takes about 1 hour normally.

How do I set the automatic retry in case of an AlgorithmError?

In general `AlgorithmError` is a non retryable error. It depicts something went wrong with the training script. — Marc Karp, Nov 30 '22 at 01:46
The problem is not in the training but in the inference script. And as I said, the problem on the same dataset and same parameters only appears a few times so I cannot reproduce it. The problem is that sometimes it does not install some libraries (there is some package installation code inside the python code). — Giuseppe La Gualano, Nov 30 '22 at 14:40

How to set sagemaker retry policy for pipeline steps for AlgorithmError

0 Answers0