Optimize deep Q network with long episode

Question

I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.

So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:

The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.

All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.

Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.

======================= update of some definitions =================

0. initial stage:

a 100 * 10 matrix, with every element as empty

1. action space:

each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.

2. environment:

each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).

This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.

This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.

3. episode:

By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.

4. constraints

Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.

Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.

5. goal

The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.

So far I still cannot receive any comments or response. It is because the question itself is not clear? — lllllllllllll, May 20 '19 at 15:33
I have no idea how to solve your problem. But maybe you can clarify some concepts. It's not clear to me what is you general RL setup, I mean, what is the environment, your state space, your action space, the agent goal, etc. You said that you are learning a 100*10 matrix, but what is the meaning of that matrix? How is related with the RL problem? I guess a little extra context could be very useful to obtain help from other users. — Pablo EM, May 20 '19 at 18:45
On the other hand, I'm curious why do you expect to solve the problem in 100 episodes. Have you experience in similar problems? Have you solved a simpler version of the same problem using deep RL? — Pablo EM, May 20 '19 at 18:45
The problem seems to be the reward calculation, how do you calculate it? Can't you use the previous result of the reward and only take into account the newly selected candidate to update that value? — agold, May 21 '19 at 05:51
@agold then the problem becomes not "long term" reward, and no need for a reinforcement learning, right? A greedy search can solve it? — lllllllllllll, May 21 '19 at 07:27
A greedy search can work if your reward allows you to select the action that solves the problem (almost) directly. To help you more, you should explain your problem more thoroughly: what is your state space, actions, rewards, and how do you calculate this reward? — agold, May 22 '19 at 07:59
@PabloEM Hey, sorry for holding this for too long. I have updated the problem with more information. Hope it makes the problem a bit clear. Thank you! — lllllllllllll, May 27 '19 at 10:07
Thanks for the update! Some additional questions: 1) as far as I understand, the position of a candidate in the matrix is relevant, right? It is not the same having a candidate c in the first element of the matrix, say M[0,0], that in the last element, M[100, 10]. 2) If you put a candidate c1 in the matrix, is c1 still in the pool of candidates? And last but not least 3) is required a sequence of states and actions to achieve a specific state? I mean, can you put into the matrix an arbitrary number of candidates in one step? — Pablo EM, May 27 '19 at 14:09
@PabloEM Thank you for the inquiry. On 1: the relative position matters; that is, c1; c2; c3 is different from c2; c1; c3, but the absolute position does not matter; On 2: Yes, it is. On 3: no, I shouldn't do that. — lllllllllllll, May 27 '19 at 14:27
Regarding 3, you are only able to add a candidate in each step, or could you also modify the (relative) position of the candidate, even remove the candidate? I'm just trying the understand the potential regularities of the problem. You said it's required at least 100 episodes, but how many steps have an episode? Under which conditions an episode ends? — Pablo EM, May 27 '19 at 14:42
Thank you! @PabloEM No, it cannot remove or modify the inserted candidates. We are estimating the number of episodes given some empirical studies (e.g., use a 20*10 matrix); So episode it should have 100 * 10 steps. It ends either 1) finished all the 100* 10 steps; or 2) the returned reward score is over a predefined thredshold, say, 25000 in our current setting. — lllllllllllll, May 27 '19 at 15:02
Thanks for all the details, but I'm afraid I don't know any magical solution for your problem. I was trying to determine if you really need RL or you can approach the problem using other techniques. But maybe you can't, not sure. After so many questions I feel compelled to write a response, although it's more an opinion than a solution. But I'll try it tomorrow if I have some time. Anyway, good luck with the new bounty! — Pablo EM, May 28 '19 at 21:04

score 3 · Answer 1 · answered May 25 '19 at 19:05

Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.

You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.

I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.

Thank you for the answer. I have updated the problem with more information. Basically the "reward function" is performed by measuring certain system performance therefore it do takes a while to compute one reward value, each step.' — lllllllllllll, May 27 '19 at 10:08
We have tried to optimize that part for the past week, by changing each computation from 5 mins to 2 mins... But as you can see, it still takes a while... — lllllllllllll, May 27 '19 at 10:09

score 3 · Answer 2 · answered May 29 '19 at 15:54

Not a solution to your question, just some general thoughts that maybe are relevant:

One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.

And regarding to your specific question:

In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.

That said, I'm looking forward to see an answer that really helps you to solve your problem.

I was going to post the same point about how the complexity of a problem could grow exponentially could explain the OPs issue. — hfontanez, Jun 04 '19 at 14:40

Optimize deep Q network with long episode

2 Answers2