3

Sagemaker pipelines are rather unclear to me, I'm not experienced in the field of ML but I'm working on figuring out the pipeline definitions.

I have a few questions:

  • Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.

  • Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?

  • There's also a Python SDK, how does this differ from the CDK and CloudFormation?

I can't seem to find any examples besides the Python SDK usage, how come?

The docs and workshops seem only to properly describe the Python SDK usage,it would be really helpful if someone could clear this up for me!

1 Answers1

3

SageMaker has two things called Pipelines: Model Building Pipelines and Serial Inference Pipelines. I believe you're referring to the former

A model building pipeline defines steps in a machine learning workflow, such as pre-processing, hyperparameter tuning, batch transformations, and setting up endpoints

A serial inference pipeline is two or more SageMaker models run one after the other

A model building pipeline is defined in JSON, and is hosted/run in some sort of proprietary, serverless fashion by SageMaker

Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.

You can create/modify them using the API, which can also be called via the CLI, Python SDK, or CloudFormation. These all use the AWS API under the hood

You can start/stop/view them in SageMaker Studio:

Left-side Navigation bar > SageMaker resources > Drop-down menu > Pipelines

Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?

Unlikely. CodePipeline is more for building and deploying code, not specific to SageMaker. There is no direct integration as far as I can tell, other than that you can start a SM pipeline with CP

There's also a Python SDK, how does this differ from the CDK and CloudFormation?

The Python SDK is a stand-alone library to interact with SageMaker in a developer-friendly fashion. It's more dynamic than CloudFormation. Let's you build pipelines using code. Whereas CloudFormation takes a static JSON string

A very simple example of Python SageMaker SDK usage:

processor = SKLearnProcessor(
    framework_version="0.23-1",
    instance_count=1,
    instance_type="ml.m5.large",
    role="role-arn",
)

processing_step = ProcessingStep(
    name="processing",
    processor=processor,
    code="preprocessor.py"
)

pipeline = Pipeline(name="foo", steps=[processing_step])
pipeline.upsert(role_arn = ...)
pipeline.start()

pipeline.definition() produces rather verbose JSON like this:

{
"Version": "2020-12-01",
"Metadata": {},
"Parameters": [],
"PipelineExperimentConfig": {
    "ExperimentName": {
        "Get": "Execution.PipelineName"
    },
    "TrialName": {
        "Get": "Execution.PipelineExecutionId"
    }
},
"Steps": [
    {
        "Name": "processing",
        "Type": "Processing",
        "Arguments": {
            "ProcessingResources": {
                "ClusterConfig": {
                    "InstanceType": "ml.m5.large",
                    "InstanceCount": 1,
                    "VolumeSizeInGB": 30
                }
            },
            "AppSpecification": {
                "ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
                "ContainerEntrypoint": [
                    "python3",
                    "/opt/ml/processing/input/code/preprocessor.py"
                ]
            },
            "RoleArn": "arn:aws:iam::123456789012:role/foo",
            "ProcessingInputs": [
                {
                    "InputName": "code",
                    "AppManaged": false,
                    "S3Input": {
                        "S3Uri": "s3://bucket/preprocessor.py",
                        "LocalPath": "/opt/ml/processing/input/code",
                        "S3DataType": "S3Prefix",
                        "S3InputMode": "File",
                        "S3DataDistributionType": "FullyReplicated",
                        "S3CompressionType": "None"
                    }
                }
            ]
        }
    }
  ]
}

You could use the above JSON with CloudFormation/CDK, but you build the JSON with the SageMaker SDK

You can also define model building workflows using Step Function State Machines, using the Data Science SDK, or Airflow

Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
  • Thanks Neil, great explanation. What still confuses me is that I don't see any properties like `Domain/Studio Id` exposed on the CDK/CF resources. To create a pipeline I suppose it requires me to have a studio up and running? – Bruno Schaatsbergen Dec 02 '21 at 07:50
  • 1
    @BrunoSchaatsbergen, no, you don't need Studio to create a SageMaker Pipelines DAG. Studio is just an optional UI, but you can do everything with code too. – Olivier Cruchant Dec 02 '21 at 08:45
  • 1
    And there is a connection between SageMaker Pipelines and CodePipeline, which is that the initial SageMaker Pipelines Projects (Cloudformation devops template) used CodePipeline as an orchestrator to trigger SageMaker Pipelines execution as illustrated here https://aws.amazon.com/fr/blogs/machine-learning/building-automating-managing-and-scaling-ml-workflows-using-amazon-sagemaker-pipelines/ – Olivier Cruchant Dec 02 '21 at 08:46
  • I'm still not sure what the difference between SageMaker Pipelines and CDK Pipelines. Isn't SageMaker Pipelines supposed to be the CI/CD solution to ML? What would the CDK offer on top of this? – Cybernetic Mar 22 '22 at 21:18
  • @Cybernetic the only similarity is that they both have the word "pipeline" in them – Neil McGuigan Mar 22 '22 at 22:33