On a sagemaker project, i have a pipeline containing several steps. In particular, the batch transform step has some internal problems which occur infrequently and i am therefore unable to replicate the error.
This step goes into "AlgorithmError" after 16 long hours (i have not set any timeout), even though the problem occurs immediately.
The current solution is that after the step has failed, i can do a manual retry of the pipeline (e.g. the "retry" button on SageMaker Studio) and restart this step with a good chance that it will not fail again. For the moment it always works, but i would like to automate this retry. Clearly, this is a way around the problem while waiting for the final solution.
I have seen that there is a "Retry Policy for Pipeline Steps" and batch transform jobs also apply, but I have not quite understood which of the "Supported exception types" is suitable and how to configure it for "AlgorithmError".
To put it in context, this is my simple pipeline step code (so configurational parts of retry, retries etc. are all by default):
from sagemaker.transformer import Transformer
from sagemaker.workflow.steps import TransformStep
from sagemaker.inputs import TransformInput
transformer = Transformer(
model_name=my_model.properties.ModelName,
instance_count=test_processor_instance_count,
instance_type=test_processor_instance_type,
output_path=my_output_test_path,
)
step_transform = TransformStep(
name="TestInference",
description="Test Inference",
transformer=transformer,
inputs= TransformInput(
data=my_data_s3_uri,
content_type = 'application/octet-stream',
data_type = 'S3Prefix'
)
)
Is it possible to avoid waiting 16 hours before it fails? The whole step takes about 1 hour normally.
How do I set the automatic retry in case of an AlgorithmError?