0

I am new to Q-learning, and I recently tried to apply this algorithm to a problem with 9 states and 2 possible actions. I am considering a big number of time series, each of which has only 10 data points, and want to choose between two actions at time t=10. The problem is that matrix Q has not been updated yet for most states, leading to a random decision.

I was considering clustering the time series and getting an averaged Q for each cluster, from which I would choose an action based on the state of each particular series.

The question is whether taking the mean of multiple Q matrices could make sense or if there is any other approach that could be more suitable in this case.

Thanks for your help!

som
  • 11
  • 3
  • Could you provide a little more information? What is the reward structure? In addition, it is not clear to me the relation between the time series and the states. Could you clarify that too? – Juan Leni Aug 03 '17 at 12:13
  • For each product, I have 4 time series A, B, C and D. Their (simplified) meaning is the following: A: simple forecast of time series C B: alternative forecast of time series C C: actual value of the series D: more expensive but (theoretically) more accurate forecast of time series C – som Aug 07 '17 at 18:18
  • The state is defined comparing A and B with C (for instance, if A is more than 10% over the value of C and B is less than 10% below the value of C, the state would be "+-"). If A is close to C and B is more than 10% over the value of C, the state would be "0+". So there is a total of 9 states corresponding to all possible combinations. – som Aug 07 '17 at 18:18
  • There are two possible actions: choosing series A or choosing series D in order to forecast C. If A is chosen, the reward is -abs((A-C)/C) and if D is chosen, the reward is -abs((D-C)/C). – som Aug 07 '17 at 18:19

0 Answers0