0

I've been using the blackbox challenge (www.blackboxchallenge.com) to try and learn some reinforcement learning.

I've created a task and an environment for the challenge and I'm using PyBrain to train based on the black box environment. The summary of the environment is that you have a number of features for each state which is a numpy ndarray of floating points and a set number of actions. For the training example it is 36 features and 4 actions.

I've tried both the Q_LinFA and the QLambda_LinFA learners but both have their coefficients overflow (the ._theta array). During training the values start out OK and rapidly increase until they are all NaN. I had a similar problem when I tried implementing Q-learning with linear function approximator myself. I've also tried scaling the features down to -1,1 but this did not help anything.

My code is below:

from bbox_environment import *
from bbox_task import *
import numpy as np
from pybrain.rl.learners.valuebased.linearfa import QLambda_LinFA 
from pybrain.rl.learners.valuebased import ActionValueNetwork
from pybrain.rl.agents.linearfa import LinearFA_Agent
from pybrain.rl.experiments import EpisodicExperiment

test_env = bbox_environment("../levels/train_level.data")
test_task = bbox_task(test_env)
#test_controller = ActionValueNetwork(test_env.outdim,test_env.numActions)
learner = QLambda_LinFA(4,36)
agent = LinearFA_Agent(learner)
experiment = EpisodicExperiment(test_task,agent)

num_episodes = 5 
i = 0

while(i < num_episodes):
    experiment.doEpisodes()
    agent.learn()
    agent.reset()
    print learner._theta
    i = i + 1

My intuition is that it might have something to do with these two runtime errors but I can not figure it out. Please help?

/usr/local/lib/python2.7/dist-packages/pybrain/rl/learners/valuebased/linearfa.py:81: RuntimeWarning: invalid value encountered in subtract
  tmp -= max(tmp)
/usr/local/lib/python2.7/dist-packages/pybrain/rl/learners/valuebased/linearfa.py:126: RuntimeWarning: invalid value encountered in double_scalars
  td_error = reward + self.rewardDiscount * max(dot(self._theta, next_state)) - dot(self._theta[action], state)

2 Answers2

0

I had same issue without regression in loss. Add something like sum of squares of thetas to your td_error, it should fix problem. However, idea of regularization is one of central ideas in ML, so try to learn about it.

Spoilt333
  • 51
  • 5
0

I am not familiar with the libraries you are using but this kind of problem is usually due to a bad learning rate (alpha parameter). I would recommend that you try to implement a learning rate which is decreasing over time like 1/t (t for time step), or more generally that respects the condition (2.8) provided here.

Hatim Khouzaimi
  • 515
  • 4
  • 11