I'm currently running a quick Machine Learning proof of concept on AWS with SageMaker, and I've come across two libraries: sagemaker
and sagemaker_pyspark
. I would like to work with distributed data. My questions are:
Is using
sagemaker
the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implementedsagemaker_pyspark
? Based on this assumption, I do not understand what it would offer regarding usingscikit-learn
on a SageMaker notebook (in terms of computing capabilities).Is it normal for something like
model = xgboost_estimator.fit(training_data)
to take 4 minutes to run withsagemaker_pyspark
for a small set of test data? I see that what it does below is to train the model and also create an Endpoint to be able to offer its predictive services, and I assume that this endpoint is deployed on an EC2 instance that is created and started at the moment. Correct me if I'm wrong. I assume this from how the estimator is defined:
from sagemaker import get_execution_role
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator
xgboost_estimator = XGBoostSageMakerEstimator (
trainingInstanceType = "ml.m4.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.m4.xlarge",
endpointInitialInstanceCount = 1,
sagemakerRole = IAMRole(get_execution_role())
)
xgboost_estimator.setNumRound(1)
If so, is there a way to reuse the same endpoint with different training jobs so that I don't have to wait for a new endpoint to be created each time?
Does
sagemaker_pyspark
support custom algorithms? Or does it only allow you to use the predefined ones in the library?Do you know if
sagemaker_pyspark
can perform hyperparameter optimization? From what I see,sagemaker
offers theHyperparameterTuner
class, but I can't find anything like it insagemaker_pyspark
. I suppose it is a more recent library and there is still a lot of functionality to implement.I am a bit confused about the concept of
entry_point
andcontainer
/image_name
(both possible input arguments for theEstimator
object from thesagemaker
library): can you deploy models with and without containers? why would you use model containers? Do you always need to define the model externally with theentry_point
script? It is also confusing that the classAlgorithmEstimator
allows the input argumentalgorithm_arn
; I see there are three different ways of passing a model as input, why? which one is better?I see the
sagemaker
library offers SageMaker Pipelines, which seem to be very handy for deploying properly structured ML workflows. However, I don't think this is available withsagemaker_pyspark
, so in that case, I would rather create my workflows with a combination of Step Functions (to orchestrate the entire thing), Glue processes (for ETL, preprocessing and feature/target engineering) and SageMaker processes usingsagemaker_pyspark
.I also found out that
sagemaker
has thesagemaker.sparkml.model.SparkMLModel
object. What is the difference between this and whatsagemaker_pyspark
offers?