let say I have n-states S={s1,s2,s3, ..... sn } and I have a score for every transition i.e. T-matrix f.e. s1->s5 = 0.3, s4->s3 = 0.7, ....etc.
What algorithm or procedure should I use to select the best scored sequence/path starting from state-x (s_x).
Two questions :
- Pick best next State, so that in infinitely long path I pick as best as possible state on average ?
- Given path-length L , pick the sequence of states that will generate the highest score ?
I'm currently researching Reinforcement learning, but it seems like overkill, because I have neither Actions, nor Policies. May be I can use something like a Value function, dunno.
What would you use ?
PS>In some of the scenario T-matrix may change over time.
http://mnemstudio.org/path-finding-q-learning-tutorial.htm
It seems that Q-learning is a good bet. The only difference I see is that if I'm to store Q-values over time I have to figure way to accommodate for changing T-matrix.
And the second harder one is that there is no final goal, but only changing intermediary scores. May be I don't need to change the algorithm it will simply converge towards changing scores, which is OK I think.
My initial thoughts were on every time-step to do L-steps best path (i.e. recalculate Q every time from scratch), but if I can I will prefer to keep a changing Q table according to incoming data.