1

Last week I've read a paper suggesting MDP as an alternative solution for recommender systems, The core of that paper was representation of recommendation process in terms of MDP, i.e. states, actions, transition probabilities, reward function and so on.

If we assume for simplicity a single-user system, then states look like k-tuples (x1, x2, .. , xk) where last element xk represents the very last item that was purchased by the user. For example, suppose our current state is (x1, x2, x3) which means, the user purchased x1, then x2, then x3, in chronological order. Now if he purchases x4, the new state is going to be (x2, x3, x4).

Now, what the paper suggests, is that, these state transitions are triggered by actions, where action is "recommending an item x_i to the user". but the problem is that such an action may lead to more than one state.

For example if our current state is (x1, x2, x3), and action is "recommend x4" to the user, then the possible outcome might be one out of two:

the user accepts the recommendation of x4, and new state will be (x2, x3, x4)
the user ignores the recommendation of x4 (i.e. buys something else) and new state will be any state (x2, x3, xi) where xi != x4

My question is, does MDP actually support same action triggering two or more different states ?

UPDATE. I think the actions should be formulated as "gets recommendation of item x_i and accepts it" and "gets recommendation of item x_i and rejects it" rather than simply "gets recommendation of item x_i"

mangusta
  • 3,470
  • 5
  • 24
  • 47

2 Answers2

1

Based on this Wikipedia article, yes, it does.

I'm no expert on this, as I only just looked up the concept, but it looks as though the set of states and the set of actions have no inherent relation. Thus, multiple states can be linked to any action (or not linked) and vice versa. Therefore, an action can lead to two or more different states, and there will be a specific probability for each outcome.

Note that, in your example, you may have to have a set of all possible states (which seems as though it could be infinite). Further....based on what I'm reading, your states perhaps shouldn't record past history. It seems as though you could record history by keeping a record of the chain itself - instead of (x1, x2, x3, xi) as a state, you'd have something more like (x1) -> (x2) -> (x3) -> (xi) - four states linked by actions. (Sorry about the notation. I hope that the concept makes sense.) This way, your state represents the choice of purchase (and is therefore finite).

  • thanks for reply. the paper says that the states may be k-tuples of any size, so k=1 is possible as well. I haven't read the part discussing pros/cons of k-value selection yet, so I can't dispute about it : ) what I am interested in, is the possibility of using the same action for transition into several different states. I've read the wiki as well, but there's nothing about it – mangusta Mar 28 '16 at 02:39
  • there is also a concept of Q-learning, which defines an action-value function `Q(s,a)`. it maps every state-action pair into reward value, hence we can choose the best action while being at state `s` by comparing `Q(s,a)` values for all actions `a` available at state `s`. But if the same action may lead to different states, it means that `Q(s,a)` will be the same for all those transitions, which makes a little sense – mangusta Mar 28 '16 at 02:52
0

Sure, this is called randomized policy. If you want to evaluate the reward of a certain policy, you have to take the expectation over the probability distribution of the randomized actions.

The following reference may be of interest: Puterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

If I remember correctly, it is proven that there is a deterministic policy that gives the optimal reward for any MDP with a finite discrete state space and action space (and possibly some other conditions). While there may be randomized policies that give the same reward, we can thus restrict to searching in the set of deterministic policies.

Forzaa
  • 1,465
  • 4
  • 15
  • 27