which one to use to process data for sagemaker batch inferencing pipeline - SKlearnEstimator or SKlearnProcessor

Question

I'm building a Sagemaker batch inferencing pipeline and get confused about the options to process features (before inferencing) between using sagemaker.sklearn.processing.SKLearnProcessor and sagemaker.sklearn.estimator.SKLearn My understanding of these two options are:

There are docs from aws to use sagemaker.sklearn.estimator.SKLearn to do the batch transformation to process the data. The pros of using this class and its .create_model() method is that I can incorporate the created model(to process the feature before inferencing) to sagemaker.pipeline.PipelineModel which's deployed on endpoint. so the whole pipeline is behind a single endpoint to be called when inference request input in. This detailed from: https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.html I don't know the specific cons, and that's the first question (1).

However, if it's only for data processing, I can also use sagemaker.sklearn.processing.SKLearnProcessor to create Sagemaker Processing jobs to process features, then dump to s3 for model to batch inferencing. The pros to me is that it's making more sense to me to have a job that designed for processing, but cons is that it seems like I have to write a handler to pipeline the processing and inferencing myself, unlike the sagemaker.sklearn.estimator.SKLearn. https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.html So, my next question (2) is there a way to involve SKLearnProcessor in the sagemaker.pipeline.PipelineModel? if not, the following up question (3) is that if SKLearnProcessor is not designed for using in inferencing, what's the use case of it.

The final question (4) is that from efficiency perspective, what's pros and cons using each method in a Sagemaker batch inferencing pipeline?

score 1 · Accepted Answer · answered Oct 03 '22 at 16:44

SageMaker Inference Pipeline is a functionality of SageMaker hosting whereby you can create a serial inference pipeline (chain of containers) on an endpoint and/or Batch Transform Job.

With regards to the link you shared, a common pattern is to use two containers where one container hosts the Scikit-learn model which will act as the pre-processing step before passing the request onto the second container which hosts the model either on an endpoint or Batch Transform Job.

The SKLearnProcessor is used to kick off a SKLearn Processing Job. You can use the SKLearnProcessor with a processing script to process your data. As such, SKLearnProcessor cannot be used in a Serial Inference Pipeline (sagemaker.pipeline.PipelineModel).
As stated above SKLearnProcessor is designed to kick off a SageMaker Processing Job that makes use of the Scikit-learn container that can be used for data pre- or post-processing and model evaluation workloads. Kindly see this link for more information.
Are you are trying to decide whether to process your data with SKLearnProcessor (Processing Job) or make use of a PipelineModel that contains a preprocessing step in a Batch Transform Job?

If so, making the decision depends on your use case. If you were to use use a Processing Job (SKLearnProcessor) then the Job would need be to kicked off before the Batch Transform Job. Once the Processing Job has completed you can then kick of the Batch Transform Job with the output of the Processing Job as input to the Batch Transform Job.

On the other hand, if you were to use Serial Inference Pipeline (sagemaker.pipeline.PipelineModel) then you would just need to make sure that the first container preprocesses the request to make sure it is compliant with what the model expects. This option would entail the processing being done on a request(s) basis within the Batch Transform Job itself.

crystal clear explanation. Thanks. Though I switched my inference strategy to realtime by reading data from online feature store, I choose the second option SKLearn to utilise sagemaker.pipeline.PipelineModel with two containers for processing and inferencing as suggested. — SKSKSKSK, Nov 10 '22 at 01:09

which one to use to process data for sagemaker batch inferencing pipeline - SKlearnEstimator or SKlearnProcessor

1 Answers1