1

I have found the keras-rl/examples/cem_cartpole.py example and I would like to understand, but I don't find documentation.

What does the line

memory = EpisodeParameterMemory(limit=1000, window_length=1)

do? What is the limit and what is the window_length? Which effect does increasing either / both parameters have?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958

2 Answers2

5

EpisodeParameterMemory is a special class that is used for CEM. In essence it stores the parameters of a policy network that were used for an entire episode (hence the name).

Regarding your questions: The limit parameter simply specifies how many entries the memory can hold. After exceeding this limit, older entries will be replaced by newer ones.

The second parameter is not used in this specific type of memory (CEM is somewhat of an edge case in Keras-RL and mostly there as a simple baseline). Typically, however, the window_length parameter controls how many observations are concatenated to form a "state". This may be necessary if the environment is not fully observable (think of it as transforming a POMDP into an MDP, or at least approximately). DQN on Atari uses this since a single frame is clearly not enough to infer the velocity of a ball with a FF network, for example.

Generally, I recommend reading the relevant paper (again, CEM is somewhat of an exception). It should then become relatively clear what each parameter means. I agree that Keras-RL desperately needs documentation but I don't have time to work on it right now, unfortunately. Contributions to improve the situation are of course always welcome ;).

mplappert
  • 1,314
  • 11
  • 18
4

A little late to the party, but I feel like the answer doesn't really answer the question.

I found this description online (https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html#replay-memory):

We’ll be using experience replay memory for training our DQN. It stores the transitions that the agent observes, allowing us to reuse this data later. By sampling from it randomly, the transitions that build up a batch are decorrelated. It has been shown that this greatly stabilizes and improves the DQN training procedure.

Basically you observe and save all of your state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).

Raven
  • 648
  • 1
  • 7
  • 18