1

I'm trying to create a Sagemaker hosted endpoint using RLEstimator class and the Vowpal Wabbit image to create a contextual bandit.

Example here Reinforcement Leaning with Sagemaker

My code when creating the training job works fine


vw_image_uri = "462105765813.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-rl-vw-container:vw-8.7.0-cpu"

hyperparameters = {
    "exploration_policy": "egreedy" , # supports "egreedy", "bag", "cover"
    "epsilon": 0.3 , # used if egreedy is the exploration policy
    "num_policies": 3 , # used if bag or cover is the exploration policy
    "num_arms": 6,
}     

rl_estimator = RLEstimator(
    entry_point='train.py',
    source_dir='src',
    image_uri=vw_image_uri,
    role=role,
    output_path=s3_output_path,
    code_location=s3_training_output_path,
    base_job_name=job_name_prefix,
    instance_type=instance_type,
    instance_count=1,
    hyperparameters=hyperparameters
)

But when running

# Create endpoint
endpoint_name = "nba-vw-test"
bandit_model.deploy(
    initial_instance_count=1, 
    instance_type=instance_type, 
    endpoint_name=endpoint_name
)

It's failing. In the Cloudwatch logs I see:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/vw_serving/vw_model.py", line 81, in start
    self.predict([])
  File "/usr/local/lib/python3.6/dist-packages/vw_serving/vw_model.py", line 131, in predict
    scores = np.array(list(map(float, self.current_proc.stdout.readline().split())))

Which is weird because I don't have a vw_model.py file. I'm assuming its taken from here VW Serving code

These extra files seem to be created and used automatically when that endpoint is created, but they conflict with my model since it doesn't predict in that format. I'm not sure how to get around this.

My filepath goes src/

  • train
  • env
  • vw_agent
  • io_utils
  • vw_utils
  • namespace

Similar as the walkthrough I showed before. Has anyone dealt with this?

To summarise, I want to modify the input and output data and response of the endpoint.

I followed the walkthrough and read through the docs but limited information on this issue.

  • You stated that Training completed, are you able to load up the model locally before deploying to a SageMaker endpoint? – Marc Karp May 16 '23 at 03:21
  • I haven't tried that but the model is stored in s3 so assume so. Is there an extra step that I'm missing? – Cris Pineda May 16 '23 at 08:24
  • Found the issue. The walkthrough uses a different image: `462105765813.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rl-vw-container:adf` compared to what is available for my region: `462105765813.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-rl-vw-container:vw-8.7.0-cpu`. Not sure where to find the `adf` one now. – Cris Pineda May 16 '23 at 10:36

0 Answers0