what should the Q matrix dimensions be in an open-like environment for Q-learning

Question

I want to implement Q-learning in the Bipedal Walker v2 of OpenAI but after looking for tutorials, they seem to always be finite environment which make the Q matrix and reward matrix simple to initialize.

e.g: http://mnemstudio.org/path-finding-q-learning-tutorial.htm

my only question is, what should be the dimensions of those matrix in a more open-like environment such as the one I want to use?

Environment in question: https://gym.openai.com/envs/BipedalWalker-v2/

Observation you get (note that there is some value which can be infinite): https://github.com/openai/gym/wiki/BipedalWalker-v2

Dennis Soemers · Answer 1 · 2018-06-22T17:59:24.410

Reinforcement Learning methods that store Q values in a matrix (or table) are referred to as tabular RL methods. These are the most straightforward/simple approaches, but as you have discovered, not always easily applicable.

One solution you can try is to discretize your state space, create lots of "bins". For example, the hull_angle observation can range from 0 to 2*pi. You could, for example, map any state in which 0 < hull_angle <= 0.1 to the first bin, states with 0.1 < hull_angle < 0.2 to the second bin, etc. If there is an observation that can range from -inf to +inf, you can simply decide to put a threshold somewhere and treat every value beyond that threshold as the same bin (e.g. everything from -inf to -10 maps to the same bin, everything from 10 to +inf another bin, and then smaller areas for more bins in between).

You'd have to discretize every single one of the observations into such bins though (or simply throw some observations away), and the combination of all bin indices together would form a single index into your matrix. If you have 23 different observations, and create for example 10 bins per observation, your final matrix of Q values will have 10^23 entries, which is a... rather big number that probably doesn't fit in your memory.

A different solution is to look into different RL methods with Function Approximation. The most simple class of methods with function approximation use Linear Function Approximation, and those are the methods I'd recommend looking into first for your problem. Linear Function approximation methods essentially try to learn a linear function (a vector of weights) such that your Q-values are estimated by taking the dot product between the vector of weights and your vector of observations / features.

If you're familiar with the draft of the second edition for Sutton and Barto's Reinforcement Learning book, you'll find many such methods throughout chapters 9-12.

Another class of function approximation methods uses (deep) Neural Networks as function approximators, instead of linear functions. These may work better than linear function approximation, but are also much more complicated to understand and often require a long time to run. If you want to get the best results, they may be good to take a look at, but if you're still learning and have never seen any non-tabular RL methods yet, it's probably wise to look into simpler variants such as Linear Function Approximation first.

I looked up a bit more and yeah a function approximation is way to go. From what I saw you pretty much just do gradient descent with reward as the output label and 3 or 4 frame of inputs at the same time, training for only one epoch each time. My question now is how with this can you predict more than the very next state, do we use another neural net to approximate the state and then use the reward approximator on that generated state for the bellman's equation? — Tissuebox, Jun 22 '18 at 20:40
@Tissuebox No, that would be yet another class of RL methods named "model-based RL''. The standard approach, even with function approximation, is still to try to learn a `Q`-function that really predicts `Q`-values (e.g. the training signal is not just the one-step reward, but it is for example the one-step reward plus `gamma` times the `Q`-value as predicted by your function learned so far for the next state. Exactly the same kind of update rule as you have in tabular RL. You can view the `Q`-table in tabular RL also as a "function approximator", which perfectly distinguishes all states — Dennis Soemers, Jun 23 '18 at 08:33

score 1 · Answer 2 · answered Jun 22 '18 at 17:34

In case of a continuous state space, it is prudent to look at neural network approximation instead of binning the data, especially in your case, where there are multiple state features. Binning the data would still have the curse of dimensionality associated with it. If you want to use Q-learning, take a look at Deep Q-Networks. It is a very popular version of deep RL which has been popularized by Google DeepMind. In case, you are wondering how to start with the problem, look at simple github examples using keras which is a very simple neural network library.

what should the Q matrix dimensions be in an open-like environment for Q-learning

2 Answers2