VW contextual bandits: historical data and online learning

Question

I'd like to test CB for e-commerce task: personal offer recommendations (like "last chance to buy", "similar positions", "consumers recommend", "bestsellers", etc.). My task is to order them (more relevant issue is higher in the list of recommendations).

So, there are 5 possible offers. I have some historical data collected without using any model: context (user and web-session features), action id (one of my 5 offers), reward (1 if user clicked this offer, 0 - not clicked). So I have N users and 5 offers with known reward, totally 5*N rows in my historical data.

Ex:

1:1:1 | user_id:1 f1:... f2:... 2:-1:1 | user_id:1 f1:... f2:... 3:-1:1 | user_id:1 f1:... f2:...

This means that user 1 have seen 3 offers (1,2,3), cost of the 1 offer is equal to 1 (didn't click), user ckickes on offers 2 and 3 (cost is negative -> reward is positive). Probabilities are equal to 1, since all offers were shown and we know rewards.

Global task is to increase CTR. I'd like to use this data for training CB and then improve the model by exploration/exploitation policies. I set probabilities equal to 1 in this data (Is it right?). But next I'd like to set the order of offers according to rewards.

Should I use for this warm start in VW CB? Will this work correctly with data collected without using CB? Maybe you can advise more relevant methods in CB for this data and task?

Thanks a lot.

Paul Mineiro · Answer 1 · 2020-06-16T14:56:33.267

If there are only 5 possible offers and if you (as indicated) have data of the form "I have N users and 5 offers with known reward, totally 5*N rows in my historical data." then your historical data is supervised multilabel data and the warm-start functionality would apply; make sure you use the cost-sensitive version to accommodate the multilabel aspect of your historical data (i.e., there is more than one offer that would result in a click).

Will this work correctly with data collected without using CB?

Because the every action-reward is specified for every user in the data set, you only have to ensure that the sample of users is representative of the population you care about.

Maybe you can advise more relevant methods in CB for this data and task?

The first paragraph started with "if" because the more typical case is 1) there are many possible offers and 2) users have only seen a few of them historically.

In such case what you have is a combination of a degenerate logging policy and multiple rewards being revealed. If there are k possible actions but each user has only seen n<=k historically then you could try and make n lines for each user as you did. Theoretically this does not necessarily work but in practice it might help.

Out of the box: change the data

If the data you have was collected as the result of running an existing policy, then an alternative would be to start randomizing the decisions made by that system in order to collect a dataset which conforms to CB. For example, use your current system to pick the "best" action 96% of the time, and one of the other 4 actions at random 4% of the time, and log the probability along with the reward (either 0.96 or 0.01 depending upon whether it was the considered best), and then set up a proper CB-style training set for vw. With this you can also counterfactually estimate the value of both your current policy and the policy vw generates, and only switch to vw when it is winning.

The fastest way to implement the last paragraph is to just start using APS.

VW contextual bandits: historical data and online learning

1 Answers1