-1

I'm trying to setup a deep neuronal network, which predicts the next move for a game agent to navigate a world. To control the game agent it takes two float inputs. The first one controls the speed (0.0 = stop/do not move, 1.0 = max. speed). The second controls the steering (-1.0 = turn left, 0.0 = straight, +1.0 = turn right).

I designed the network so the it has two output neurons one for the speed (it has a sigmoid activation applied) and on for the steering (has a tanh activation). The actual input I want to feed the network is the pixel data and some game state values.

To train the network I would simply run a whole game (about 2000frames/samples). When the game is over I want to train the model. Here is where I struggle, how would my loss-function look like? While playing I collect all actions/ouputs from the network, the game state and rewards per frame/sample. When the game is done I also got the information if the agent won or lost.

Edit:

This post http://karpathy.github.io/2016/05/31/rl/ got me inspired. Maybe I could use the discounted (move, turn) value-pairs, multiply them by (-1) if game agent lost and (+1) if it won. Now I can use these values as gradients to update the networks weights?

It would be nice if someone could help me out here.

All the best, Tobs.

Tobs
  • 27
  • 3

1 Answers1

1

The problem you are talking is belong to reinforcement-learning, where agent interact with environment and collect data that is game state, its action and reward/score it got at end. Now there are many approaches.

The one you are talking is policy-gradient method, And loss function is as E[\sum r], where r is score, which has to be maximized. And its gradient will be A*grad(log(p_theta)), where A is advantage function i.e. +1/-1 for winning/losing. And p_theta is the probability of choosing action parameterized by theta(neural network). Now if it has win, the gradient will be update in favor of that policy because of +1 and vice-versa.

Note: There are many methods to design A, in this case +1/-1 is chosen.

More you can read here in more detail.

Ankish Bansal
  • 1,827
  • 3
  • 15
  • 25