68

I know the basics of feedforward neural networks, and how to train them using the backpropagation algorithm, but I'm looking for an algorithm than I can use for training an ANN online with reinforcement learning.

For example, the cart pole swing up problem is one I'd like to solve with an ANN. In that case, I don't know what should be done to control the pendulum, I only know how close I am to the ideal position. I need to have the ANN learn based on reward and punishment. Thus, supervised learning isn't an option.

Another situation is something like the snake game, where feedback is delayed, and limited to goals and anti-goals, rather than reward.

I can think of some algorithms for the first situation, like hill-climbing or genetic algorithms, but I'm guessing they would both be slow. They might also be applicable in the second scenario, but incredibly slow, and not conducive to online learning.

My question is simple: Is there a simple algorithm for training an artificial neural network with reinforcement learning? I'm mainly interested in real-time reward situations, but if an algorithm for goal-based situations is available, even better.

Kendall Frey
  • 43,130
  • 20
  • 110
  • 148
  • 2
    Good question, and I'm thinking almost exactly the same thing, where in my case the neural network is recurrent. One key point is that you're talking about 2 different learning algorithms. You cannot apply 2 different learning algorithms to the same problem without causing conflicts, unless you have a way to resolve them. – Yan King Yin Aug 15 '15 at 10:07

2 Answers2

32

There are some research papers on the topic:

And some code:

Those are just some of the top google search results on the topic. The first couple of papers look like they're pretty good, although I haven't read them personally. I think you'll find even more information on neural networks with reinforcement learning if you do a quick search on Google Scholar.

Rémi
  • 3,705
  • 1
  • 28
  • 39
Kiril
  • 39,672
  • 31
  • 167
  • 226
  • The third link mentioned something about Q-learning. Is that applicable to the cart-pole problem? – Kendall Frey May 23 '12 at 16:40
  • It seems to be applicable since it allows you to compare the expected utility of the available actions without having a model of the environment. So if you're doing an actual cart-pole problem with hardware, then it's going to be helpful. For more details on Q-learning see: http://www.applied-mathematics.net/qlearning/qlearning.html – Kiril May 23 '12 at 16:47
  • 3
    Doesn't Q-learning involve a finite set of actions? The ideal cart-pole problem will have a continuous set of actions. Is that a problem? – Kendall Frey May 23 '12 at 16:50
  • At the end of the day, there are many variants of NN and approaches for training which can be used to solve a problem, so you really have to spend a lot of time evaluating what's best for your problem domain. I would estimate that to be around 80% of the work! – Kiril May 23 '12 at 17:00
  • I'm not all that familiar with the cart-pole problem, but I saw a [discussion on google groups](https://groups.google.com/forum/#!topic/rl-list/qrVT5FSQonk) and it indicated that discrete actions lead to better results. It seems that there are RL + Continuous Action Space solutions involving Sequential Monte Carlo methods (referenced in the same discussion), but that seems like a slightly different approach. – Kiril May 23 '12 at 18:34
  • In the code examples for neural network reinforcement learning--particularly the M-files by Chuck Anderson--do not actually describe "reinforcement" learning. The example given uses samples and a target to train the neural network over a set of cycles. Reinforcement learning entails a reward, not a target... Also, the variable names are all shorthand, and everything is done using for loops rather than matrix operations. Is there another example? – Chris Dec 31 '15 at 14:31
  • The Code example given is for *recurrent* neural networks. The OP specified *feedforward* neural nets. It would be nice if a link was given to code that used feedforward nets. (I don't understand/haven't programmed recurrent nets yet, so the code is not nearly as helpful as it could be.) – Pro Q Apr 15 '17 at 16:19
  • 1
    @ProQ The link includes a feedforward neural network "train.c is a C program for training multilayer, **feedforward neural networks** with error backpropagation using early stopping and cross-validation." – Kiril Apr 15 '17 at 16:29
  • 1
    @Lirik ah, sorry, I had scrolled too far. Thank you! – Pro Q Apr 15 '17 at 17:11
10

If the output that lead to a reward r is backpropagated into the network r times, you will reinforce the network proportionally to the reward. This is not directly applicable to negative rewards, but I can think of two solutions that will produce different effects:

1) If you have a set of rewards in a range rmin-rmax, rescale them to 0-(rmax-rmin) so that they are all non-negative. The bigger the reward, the stronger the reinforcement that is created.

2) For a negative reward -r, backpropagate a random output r times, as long as it's different from the one that lead to the negative reward. This will not only reinforce desirable outputs, but also diffuses or avoids bad outputs.

Junuxx
  • 14,011
  • 5
  • 41
  • 71
  • Interesting. I wonder how this applies to delayed reward. I'm guessing it would work to specify anything that isn't a goal as a small negative reward. – Kendall Frey May 23 '12 at 16:35
  • @Kendall Frey: For a neural network that can handle delays better that normal neural networks, see [Long short term memory](https://en.wikipedia.org/wiki/Long_short_term_memory) ([Youtube video](http://www.youtube.com/watch?v=izGl1YSH_JA)), or see [hierarchical temporal memory](https://en.wikipedia.org/wiki/Hierarchical_temporal_memory) ([Youtube video](http://www.youtube.com/watch?v=48r-IeYOvG4)). – HelloGoodbye Feb 15 '14 at 18:58
  • 5
    Why should you rescale the rewards like that? Do they need rescaling at all? By doing that, a reward `rmin` becomes `0` after rescaling, so what was supposed to be a reward will turn out to have no reinforcement effect on the network. – HelloGoodbye Feb 15 '14 at 21:10
  • 1
    Brillant idea!! It is soo simple, but in my case also very effective! +1 – Thomas Sparber Dec 30 '16 at 16:03