0

I am trying to optimize the click rate of some article or advertisement (action) given a device type (context) with VowpalWabbit (following this article vw tutorial). However, I am not able to make it converge to the optimal actions reliably.

I created a minimal working example (sorry for the length):

import random
import numpy as np
from matplotlib import pyplot as plt
from vowpalwabbit import pyvw

plt.ion()

action_space = ["article-1", "article-2", "article-3"]

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0))
    return (cumsum[N:] - cumsum[:-N]) / float(N)

def to_vw_example_format(context, cb_label=None):
    if cb_label is not None:
        chosen_action, cost, prob = cb_label
    example_string = ""
    example_string += "shared |User device={} \n".format(context)
    for action in action_space:
        if cb_label is not None and action == chosen_action:
            example_string += "1:{}:{} ".format(cost, prob)
        example_string += "|Action ad={} \n".format(action)
    # Strip the last newline
    return example_string[:-1]

# definition of problem to solve, playing out the article with highest ctr given a context
context_to_action_ctr = {
    "device-1": {"article-1": 0.05, "article-2": 0.06, "article-3": 0.04},
    "device-2": {"article-1": 0.08, "article-2": 0.07, "article-3": 0.05},
    "device-3": {"article-1": 0.01, "article-2": 0.04, "article-3": 0.09},
    "device-4": {"article-1": 0.04, "article-2": 0.04, "article-3": 0.045},
    "device-5": {"article-1": 0.09, "article-2": 0.01, "article-3": 0.07},
    "device-6": {"article-1": 0.03, "article-2": 0.09, "article-3": 0.04}
}

#vw = f"--cb_explore 3 -q UA -q UU --epsilon 0.1"
vw = f"--cb_explore_adf -q UA -q UU --bag 5 "
#vw = f"--cb_explore_adf -q UA --epsilon 0.2"

actor = pyvw.vw(vw)

random_rewards = []
actor_rewards = []
optimal_rewards = []

for step in range(200000):

    # pick a random context
    device = random.choice(list(context_to_action_ctr.keys()))

    # let vw generate probability distribution
    # action_probabilities = np.array(actor.predict(f"|x device:{device}"))
    action_probabilities = np.array(actor.predict(to_vw_example_format(device)))

    # sample action
    probabilities = action_probabilities / action_probabilities.sum()
    action_idx = np.random.choice(len(probabilities), 1, p=probabilities)[0]
    probability = action_probabilities[action_idx]

    # get reward/regret
    action_to_reward_regret = {
        action: (1, 0) if random.random() < context_to_action_ctr[device][action] else (0, 1) for action in action_space
    }

    actor_action = action_space[action_idx]
    random_action = random.choice(action_space)
    optimal_action = {
        "device-1": "article-2",
        "device-2": "article-1",
        "device-3": "article-3",
        "device-4": "article-3",
        "device-5": "article-1",
        "device-6": "article-2",
    }[device]

    # update statistics
    actor_rewards.append(action_to_reward_regret[actor_action][0])
    random_rewards.append(action_to_reward_regret[random_action][0])
    optimal_rewards.append(action_to_reward_regret[optimal_action][0])

    # learn online
    reward, regret = action_to_reward_regret[actor_action]
    cost = -1 if reward == 1 else 0
    # actor.learn(f"{action_idx+1}:{cost}:{probability} |x device:{device}")
    actor.learn(to_vw_example_format(device, (actor_action, cost, probability)))


    if step % 100 == 0 and step > 1000:
        plt.clf()
        axes = plt.gca()
        plt.title("Reward over time")
        plt.plot(running_mean(actor_rewards, 10000), label=str(vw))
        plt.plot(running_mean(random_rewards, 10000), label="Random actions")
        plt.plot(running_mean(optimal_rewards, 10000), label="Optimal actions")
        plt.legend()
        plt.pause(0.0001)

Essentially, there are three possible actions (article 1-3) and 6 contexts (device 1-6), each combination having a certain CTR (Click Through Rate) and an optimal action given a context (article with highest ctr given the device). With each iteration, a random context is sampled and the reward/regret for each action is computed. The cost used by VowpalWabbit to learn is -1 if the reward is 1 (a user clicked) or 0 if the reward is 0 (user did not click). With time, the algorithm is supposed to find the best articles for each device.

Some results (Average Reward over time): Not Converging Better Result

The problems:

  • Starting the same program multiple times, leads to differing results (sometimes not converging at all, sometimes finding the optimal and staying with it)
  • With a bad start (early convergence to sub-optimal arms), with more time, the algorithm is still not moving to better arms.
  • It seems that the algorithm is early trapped in a local minimum, which defines the performance for the rest of the experiment. (even with some exploration factor)

As the CTRs are fairly small and a lot of playouts are needed for convergence, I understand the difficulty of the problem. I would however expect that with time, the algorithm would find the optimum.

Did I miss something in the configuration of VowpalWabbit?

  • This seems to work better: ```vw = f"--cb_explore 3 -q xx --bag 5"``` – Alexander Bartl Jun 30 '21 at 12:25
  • Are you plotting only exploit or explore and exploit? – jackgerrits Jul 06 '21 at 12:37
  • @jackgerrits its both in one. Rewards are clicks so the graphs are basically averaged clicks over time. – Alexander Bartl Jul 07 '21 at 14:40
  • I'm curious if this behavior is a function of using online bagging as your exploration method. Do you experience similar behavior with epsilon greedy or online cover (nonuniform)? I see you have epsilon greedy commented out in your code snippet above. Another thing you could try is setting a static learning rate to avoid early overconvergence with something like this: `--power_t 0 -l 0.05`. – R. Angi Jul 07 '21 at 17:15
  • Hi @R.Angi. I also tried with your suggestions: ```--cb_explore_adf -q UA -q UU --bag 5 --power_t 0 -l 0.05``` and it converges much better, however sometimes it gets stuck in a relatively good but not optimal policy. Epsilon Greedy does not get stuck in local-minima but also does not really converge. ```--cb_explore_adf -q UA -q UU --cover 1``` seems to also work (what do you mean with uniform?), however the ```--bag 5 --power_t 0 -l 0.05``` seems to produce more consistent results. Once I increase the number of arms though, it gets stuck in local optima much more. – Alexander Bartl Jul 15 '21 at 10:25
  • You can add a flag `--nounif` when using `--cover X` which, from my understanding, does not include the uniform random exploration (almost like an epsilon greedy term) in addition to sampling from the covering policies which is part of the original online cover algorithm. This shows some promising results in the Bandit Bake-off paper here: https://arxiv.org/pdf/1802.04064.pdf (I think because its a little more greedy?) All this being said, nonuniform probably won't make much of a difference. I would experiment with a lower learning rate and see if that helps? Try using vw-hypersearch-see wiki – R. Angi Jul 16 '21 at 14:57

1 Answers1

0

Just "--bag 5" without epsilon parameter can start to produce zero probabilities if all policies agree with each other.

I modified your code a bit to track probabilities from actor.predict.

First plot: moving average for predicted probabilities for all "optimal actions". We can see that we are actually staying around 0 for few of them. And the tail of the table with decisions is showing that all distributions are actually [1, 0, 0] like. So there is no chance to recover from this point since we basically have exploration turned off.

enter image description here

Adding small amount of exploration (--bag 5 --epsilon 0.02) helps to converge to global minimum eventually and give plots like this:

enter image description here

Learning seems to be not fast but problematic contexts are actually the most ambiguous ones and do not contribute much to regret.