how can I preprocess input data before making predictions in sagemaker?

Question

I am calling a Sagemaker endpoint using java Sagemaker SDK. The data that I am sending needs little cleaning before the model can use it for prediction. How can I do that in Sagemaker.

I have a pre-processing function in the Jupyter notebook instance which is cleaning the training data before passing that data to train the model. Now I want to know if I can use that function while calling the endpoint or is that function already being used? I can show my code if anyone wants?

EDIT 1 Basically, in the pre-processing, I am doing label encoding. Here is my function for preprocessing

def preprocess_data(data):
 print("entering preprocess fn")
 # convert document id & type to labels
 le1 = preprocessing.LabelEncoder()
 le1.fit(data["documentId"])
 data["documentId"]=le1.transform(data["documentId"])
 le2 = preprocessing.LabelEncoder()
 le2.fit(data["documentType"])
 data["documentType"]=le2.transform(data["documentType"])
 print("exiting preprocess fn")
 return data,le1,le2

Here the 'data' is a pandas dataframe.

Now I want to use these le1,le2 at the time of calling endpoint. I want to do this preprocessing in sagemaker itself not in my java code.

score 4 · Answer 1 · answered Jan 22 '19 at 13:11

4

There is now a new feature in SageMaker, called inference pipelines. This lets you build a linear sequence of two to five containers that pre/post-process requests. The whole pipeline is then deployed on a single endpoint.

https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html

answered Jan 22 '19 at 13:11

Julien Simon

2,605
10
20

I had a read but have a couple questions..does it cater to the need where you need batch processing IN ADDITION TO live request handling ? – Prathamesh dhanawade Jul 22 '19 at 15:01
You can use Inference Pipelines for real-time endpoints and batch transforms, but not at the same time :) A pipeline is either deployed to an endpoint, or to a transformer, you cannot mix. – Julien Simon Jul 22 '19 at 15:09
So what if I like Real-Time but would like to do Batch sometimes ? I thought we could achieve the switch capability using Inference Pipelines ?! (maybe i was wrong) – Prathamesh dhanawade Jul 22 '19 at 15:58
1

You can use the same pipeline (i.e. the same sequence of containers), but you have to specifically deploy to an endpoint, or to batch transform. So if you already have an endpoint running, you'd have to run an additional batch transform job. – Julien Simon Jul 22 '19 at 18:53
Yea that makes more sense now. Also considering the batch transform cost would vary on time taken by the job to run...smaller batches wouldn't be an issue. – Prathamesh dhanawade Jul 23 '19 at 19:06

score 2 · Answer 2 · answered Mar 31 '18 at 22:34

2

You need to write a script and supply that while creating you model. That script would have a input_fn where you can do your preprocessing. Please refer to aws docs for more details.

https://docs.aws.amazon.com/sagemaker/latest/dg/mxnet-training-inference-code-template.html

answered Mar 31 '18 at 22:34

Raman

643
5
6

Thanks @Raman. I am trying to implement this. Right now I am unable to use pandas library in the script. The script is getting executed in the mxnet environment so i am getting this error - ImportError: No module named 'pandas'. do you know how we can use external libraries in the script? – gashu Apr 05 '18 at 07:55
Checkout the response this response. [How do I load python modules which are not available in Sagemaker?](https://stackoverflow.com/questions/49665241/how-do-i-load-python-modules-which-are-not-available-in-sagemaker/49676109#49676109) – Raman Apr 09 '18 at 18:53
Is this only possible when using Apache MXNet in Sagemaker? – Sip Oct 08 '18 at 08:17

score 2 · Answer 3 · answered Apr 01 '18 at 18:55

One option is to put your pre-processing code as part of an AWS Lambda function and use that Lambda to call the invoke-endpoint of SageMaker, once the pre-processing is done. AWS Lambda supports Python and it should be easy to have the same code that you have in your Jupyter notebook, also within that Lambda function. You can also use that Lambda to call external services such as DynamoDB for lookups for data enrichment.

You can find more information in the SageMaker documentations: https://docs.aws.amazon.com/sagemaker/latest/dg/getting-started-client-app.html

sorry for the late response, I have updated my question. Basically, I have to use the same preprocessing function while calling the endpoint as I have to use label encoder. — gashu, Apr 05 '18 at 07:50