Can contextual bandit rewards be changed over time?

Question

I am working on implementing a contextual bandit with Vowpal Wabbit for dynamic pricing where arms represent price margins. The cost/reward is determined by taking price – expected cost. Cost is not known initially so it is a prediction and has the potential to change. My question is, if your cost/reward can change over time can you update the cost/reward to reflect the realized cost and retrain the model?

Below is an example with a training set with 1 feature (user) and a test set. The cost was based on the expected net revenue. The model is trained and used to predict which action to take for the customers in the test set.

import pandas as pd
import sklearn as sk
import numpy as np
from vowpalwabbit import pyvw

train_data = [{'action': 1, 'cost': -150, 'probability': 0.4, 'user': 'a'},
              {'action': 3, 'cost': 0, 'probability': 0.2, 'user': 'b'},
              {'action': 4, 'cost': -250, 'probability': 0.5, 'user': 'c'},
              {'action': 2, 'cost': 0, 'probability': 0.3, 'user': 'a'},
              {'action': 3, 'cost': 0, 'probability': 0.7, 'user': 'a'}]

train_df = pd.DataFrame(train_data)

# Add index to data frame
train_df['index'] = range(1, len(train_df) + 1)
train_df = train_df.set_index("index")

# Test data
test_data = [{'user': 'b'},
            {'user': 'a'},
            {'user': 'b'},
            {'user': 'c'}]

test_df = pd.DataFrame(test_data)

# Add index to data frame
test_df['index'] = range(1, len(test_df) + 1)
test_df = test_df.set_index("index")

# Create python model and learn from each trained example
vw = pyvw.vw("--cb 4")

for i in train_df.index:
  action = train_df.loc[i, "action"]
  cost = train_df.loc[i, "cost"]
  probability = train_df.loc[i, "probability"]
  user = train_df.loc[i, "user"]

  # Construct the example in the required vw format.
  learn_example = str(action) + ":" + str(cost) + ":" + str(probability) + " | " + str(user) 

  # Here we do the actual learning.
  vw.learn(learn_example)
  
# Predict actions
for j in test_df.index:
  user = test_df.loc[j, "user"]

  test_example = "| " + str(user)

  choice = vw.predict(test_example)
  print(j, choice)

However, after a week we received new information and the cost was higher than expected for index 0 in the training set and lower than expected at index 2. Can this new information be used to retrain the model and predict actions?

## Reward/cost changed after 1 week once cost was realized
train_data = [{'action': 1, 'cost': 200, 'probability': 0.4, 'user': 'a'}, # Lost money
              {'action': 3, 'cost': 0, 'probability': 0.2, 'user': 'b'},
              {'action': 4, 'cost': -350, 'probability': 0.5, 'user': 'c'}, # Made more than exp.
              {'action': 2, 'cost': 0, 'probability': 0.3, 'user': 'a'},
              {'action': 3, 'cost': 0, 'probability': 0.7, 'user': 'a'}]

score 1 · Answer 1 · answered Jan 06 '22 at 19:38

1

Yes, I don't see why changing the reward over time would be a problem. This is certainly how the real world works too. Actions may be less or more appropriate in a changing world. Contextual bandits work well in a non-stationary environment, so it should be fine.

One thing to note though is that if your environment is non-stationary you probably want to provide the --power_t option as 0. By default, VW's learning rate decays over time (t) as if your problem stationary you would want to converge on a solution.

answered Jan 06 '22 at 19:38

jackgerrits

791
2
8
20

Thanks for the response! Would that also apply if I am using --cb_explore_adf? I was using --cb 4 in this example for simplicity. Also, is there somewhere that lists what the default hyperparameters are set to? I am having a hard time finding that information. – aab Jan 06 '22 at 19:56
Yep, same for ADF. You can look [here](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/VW-arguments-JSON-format) for default option values. Some things aren't written in such a way that their default is in that list at which point it is looking through the code which is not great, sorry about that. I'll do a pass now to see what I can make be exposed in that wiki page. – jackgerrits Jan 06 '22 at 20:00

Can contextual bandit rewards be changed over time?

1 Answers1