I am using Sagemaker to train and deploy my machine learning model. As regard to prediction, it will be executed by a lambda function as a scheduled job (every hour). The process is as follows:
- pull new data from S3 since last prediction
- preprocess, aggregate and create prediction data set
- call sagemaker endpoint and make prediction
- either save result to s3 or insert to database table
Based on my finding, typically the input will either from lambda payload
data = json.loads(json.dumps(event))
payload = data['data']
print(payload)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
or read from s3 file: my_bucket = resource.Bucket('pred_data') #subsitute this for your s3 bucket name.
obj = client.get_object(Bucket=my_bucket, Key='foo.csv')
lines= obj['Body'].read().decode('utf-8').splitlines()
reader = csv.reader(lines)
file = io.StringIO(lines)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT,
ContentType='*/*',
Body = file.getvalue(),
Body=payload)
output = response['Body'].read().decode('utf-8')
Since I will be pulling raw data from s3 and preprocess, a pandas
dataframe will be generated. Is it possible to feed this directly as the input of invoke_endpoint
? I could upload the aggregated dataset to another S3 bucket, but does it have to go through the decoding
, csv.reader
, StringIO
and all that just like the example I found or is there an easy way to do it? Is the decode
step really necessary to get the output?