I am new to Q-learning, and I recently tried to apply this algorithm to a problem with 9 states and 2 possible actions. I am considering a big number of time series, each of which has only 10 data points, and want to choose between two actions at time t=10. The problem is that matrix Q has not been updated yet for most states, leading to a random decision.
I was considering clustering the time series and getting an averaged Q for each cluster, from which I would choose an action based on the state of each particular series.
The question is whether taking the mean of multiple Q matrices could make sense or if there is any other approach that could be more suitable in this case.
Thanks for your help!