What is the best way to train a deep reinforcement learning model with RL4J in Anylogic PLE?

Question

I am trying to build a RL model with deep Q-Learning using RL4J in the Anylogic PLE as part of my thesis. Unfortunately, I am not overly familiar with Anylogic and DL4J and therefore might be missing some obvious steps. I only have access to the PLE and am wondering what is the best approach to train a RL model in Anylogic. All the examples I found online (Traffic light, Vehicle battery) either use custom experiments or export the project as a stand-alone application to train their RL model. These functionalities aren’t accessible in the PLE and therefore I tried to come up with a different way.

A crucial part of the mentioned examples is the creation and destruction of the engine in the RL models reset() function. I am unaware of a method to do the same in the PLE without stopping the simulation all together. My basic idea of a work around was to create a function in my main agent which resets the environment to its initial state as best as possible.

A bit more about my setup: I created a separate RL agent in Anylogic which has one function containing all of the RL4J code. To train the model this function then gets called from the main agent which contains my environment and all the functions to interact with the environment (get observations, take actions, calculate rewards and check if done). On top of that the main agent contains the aforementioned reset function which resets the state chart (my environment) to the initial state, the step counter (for reward calculation) etc. Unfortunately, I wasn’t able to get this running yet, as the state of the state chart doesn't seem to update after the RL agent took an action. Hence, I can’t tell if my attempt is feasible at all or if it could never work.

I wanted to ask you if my attempt will work once I figure out what is causing the current issue or whether there is a better way of training a RL agent inside the Anylogic PLE.

I don't know if it's of importance, but the code inside my RL agent's training function is:

MDP<Encodable, Integer, DiscreteSpace> mdp = new MDP<Encodable, Integer, DiscreteSpace>() {
    
    
    ObservationSpace<Encodable> observationSpace = new ArrayObservationSpace<>(new int[] {18});

    DiscreteSpace actionSpace = new DiscreteSpace(4);

    
    public ObservationSpace<Encodable> getObservationSpace() {
        return observationSpace;
    }

    
    public DiscreteSpace getActionSpace() {
        return actionSpace;
    }

    public Encodable getObservation(){
        System.out.println(Arrays.toString(main.getObservation()));
        return new Encodable() {
            double[] a = main.getObservation();
            public double[] toArray() {
                return a;
            }
           
        public boolean isSkipped() {
            return false;
        }

        public INDArray getData() {
            return null;
        }

        public Encodable dup() {
            return null;
        }
        };
    }

    
    public Encodable reset() {
        System.out.println("Reset");
        main.resetExperiment();
        
        return getObservation();
    }

    
    public void close() {
        System.out.println("Close");
    }

    
    public StepReply<Encodable> step(Integer action) {
        
        System.out.println("Took action: "+action);
        main.takeAction(action);

        double reward = main.calcReward();
        System.out.println("Reward: "+reward);
        return new StepReply(getObservation(), reward, isDone(), null);
    }

    
    public boolean isDone() {
        return main.is_done;

    }

    
    public MDP<Encodable, Integer, DiscreteSpace> newInstance() {                   
        return null;
    }
};

try {
    DataManager manager = new DataManager(true);

  QLearning.QLConfiguration AL_QL =
            new QLearning.QLConfiguration(
                    1,
                    10000,
                    100000,
                    100000,
                    128,
                    1000,
                    10,
                    1,
                    0.99,
                    1.0,
                    0.1f,
                    30000,
                    true
            );

    DQNFactoryStdDense.Configuration AL_NET =
            DQNFactoryStdDense.Configuration.builder()
            .l2(0).updater(new RmsProp(0.001)).numHiddenNodes(300).numLayer(2).build();

    QLearningDiscreteDense<Encodable> dql = new QLearningDiscreteDense(mdp, AL_NET, AL_QL, manager);
            
    dql.train();

    DQNPolicy<Encodable> pol = dql.getPolicy();
    pol.save("Statechart.zip");

    mdp.close();
}catch (IOException e){
    e.printStackTrace();
}

If you need any further information please let me know.

Looking forward to any suggestions and thank you!

I always thought it was impossible to use RL4J using PLE, but you might have something here and this is in fact theoretically possible... but remember the PLE allows only 50,000 agents so you will have to recycle your agents because for RL to work, you need to run the simulation millions of times maybe depending on your situation... — Felipe, Dec 04 '20 at 18:54
I thought about looping my agents and by that recycling them to avoid that limitation. Are there any potential issues you can think of which I could run into with my RL4J work-around? — Daniel, Dec 04 '20 at 20:19
well, your training will be extremely slow if you do it in your own computer, if your problem is complicated, you may end up spending months of your computer resources just running trainings and shaping the reward function, and 12 hours might not be enough so running overnight might be insufficient (depending on your computer)... — Felipe, Dec 04 '20 at 20:32
I'll be able to train the model on a server cluster at my university, so hopefully it won't take that long. Thank you for your feedback! — Daniel, Dec 05 '20 at 08:14
I changed the vehicle battery examples to train the RL model inside a train function in my Anylogic main. The training process itself runs without any errors. The `takeAction()` and `getObservation()` functions get called appropriately. However, for some reason the `takeAction()` doesn't cause a state change in the state chart. When calling it manually it works fine. @Felipe do you have an idea what could cause this problem? if I'm supposed to open a new question for this, please let me know. I am unaware of what the stack overflow etiquette asks me to do for this situation — Daniel, Dec 06 '20 at 10:47
open a new question whenever you have a new different question :P — Felipe, Dec 06 '20 at 10:53
Thanks for the tip! I did so here: https://stackoverflow.com/questions/65167687/what-could-cause-a-state-chart-not-to-update-while-training-a-rl4j-model-inside — Daniel, Dec 06 '20 at 11:50

What is the best way to train a deep reinforcement learning model with RL4J in Anylogic PLE?

0 Answers0