I’m checking Vowpal Wabbit’s documentation for how it’s actually learning. Traditional Contextual Bandits learn by having F(context, action) = Reward, find action that maximizes Reward, and returns action as recommendation. The “F” is any model; linear, neural net, xgb, etc... that is learned through batch processing. I.E. collect 100 contexts, 100 actions, 100 rewards, train ML model, then do it again.
Now, on VW it says it reduces “all contextual bandit problems to cost-sensitive multiclass classification problems.” Ok, read up on that but there still needs to be some function F to minimize this problem doesn’t there?
I’ve thoroughly read the documentation and either:
- Missed what the default learner is for batch processing or,
- Don’t understand how VW is actually learning in this cost-sensitive framework?
I’ve even scoured the vw.learn() method inside pyvwlib. Thanks for the help!