I'm trying to setup a deep neuronal network, which predicts the next move for a game agent to navigate a world. To control the game agent it takes two float inputs. The first one controls the speed (0.0 = stop/do not move, 1.0 = max. speed). The second controls the steering (-1.0 = turn left, 0.0 = straight, +1.0 = turn right).
I designed the network so the it has two output neurons one for the speed (it has a sigmoid activation applied) and on for the steering (has a tanh activation). The actual input I want to feed the network is the pixel data and some game state values.
To train the network I would simply run a whole game (about 2000frames/samples). When the game is over I want to train the model. Here is where I struggle, how would my loss-function look like? While playing I collect all actions/ouputs from the network, the game state and rewards per frame/sample. When the game is done I also got the information if the agent won or lost.
Edit:
This post http://karpathy.github.io/2016/05/31/rl/ got me inspired. Maybe I could use the discounted (move, turn) value-pairs, multiply them by (-1) if game agent lost and (+1) if it won. Now I can use these values as gradients to update the networks weights?
It would be nice if someone could help me out here.
All the best, Tobs.