I'm working on Multi-Armed-Bandit problem, using LinearUCBAgent
and LinearThompsonSamplingAgent
but they both return a single action for an observation.
What I need is the probability for all the action which I can use for ranking.
Asked
Active
Viewed 159 times
1

Kushal Jain
- 11
- 2
-
Did you find out out to get the probabilities for these agents? Thank you – tjt Apr 03 '22 at 19:46
1 Answers
0
You need to add the emit_policy_info
argument when defining the agent. The specific values (encapsulated in a tuple) will depend on the agent: predicted_rewards_sampled
for LinearThompsonSamplingAgent
and predicted_rewards_optimistic
for LinearUCBAgent
.
For example:
agent = LinearThompsonSamplingAgent(
time_step_spec=time_step_spec,
action_spec=action_spec,
emit_policy_info=("predicted_rewards_sampled")
)
Then, during inference, you'll need to access those fields and normalize them (via softmax):
action_step = agent.collect_policy.action(observation_step)
scores = tf.nn.softmax(action_step.info.predicted_rewards_sampled)
where tf
comes from import tensorflow as tf
and observation_step
is your observation array encapsulated in a TimeStep (from tf_agents.trajectories.time_step import TimeStep
)
Note of caution: these are NOT probabilities, they are normalized scores; similar to the normalized outputs of a fully-connected layer.

Carlos Loza
- 1
- 1