Reinforced Learning Example

Question

Environment: There are 25 total turns. There are two types of actions: build CS and build CI.

Goal: Find the max number of CIs (buildings) which can be built in the total number of turns given using specifically machine-learning/reinforced learning.

Note: Even though CS are technically buildings I am not including their count in the total number of buildings. This is important to note when reading "buildings" in my code implies only CIs built.

Formula: BPT (buildings per turn) = CS/4 + 5 For every 4 CS built, your CI increases by 1. (you start with 5)

For example:
turn 1: build 5 CI (bpt: 5) (total buildings: 5)
turn 2: build 1 CS (bpt: 5) (total buildings: 5)
turn 3: build 1 CS (bpt: 5) (total buildings: 5)
turn 4: build 1 CS (bpt: 5) (total buildings: 5) 
turn 5: build 1 CS (bpt: 6) (total buildings: 5)
turn 6: build 6 CI (bpt: 6) (total buildings: 11) (increased by BPT 6)

My overall goal is to get to turn 25 and see what the max number of CIs which can be built. In addition to that, I want to know the steps and in which order I need to take these actions to maximize the best case scenario.

My code below seems to achieve that but it fails when I attempt to use my trained model. It's my understanding after all my episodes are completed that my q_values table would be able to map out the best possible path.

Unfortantely what is happening is my final q_values table appears to have all the same values and the use of np.argmax is simply selecting the 0th index for all the decisions. What I have noticed is that during the training my model correctly identifies the best solution but for some reason my final q_values table doesnt reflect it.

One important note: at turn 25, the max buildings should be 126 if completed correctly. The first 4 turns should be CS and the rest would be CIs maximizing the highest possibility.

import numpy as np
import math
import pdb



class AI:

    def __init__(self, turns: int, learning_rate: int, discount_factor: int, actions: list, q_values: list):
        '''
        turns: max number of turns an agent can take,
        learning_rate: the rate in which an agent should learn,
        discount_factor: the decayed reward amount
        actions: the actions which the agent can take,
        q_values: a mapping of probabilities which suggests which action should be taken at any given state

        history_cs: state - number of cs built
        history_ci: state - number of ci built (buildings)
        '''

        # default values
        self.state = 0
        self.cs = 0
        self.buildings = 0
        self.max_buildings = 0
        self.history_cs = []
        self.history_ci = []

        self.turns = turns
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.actions = actions
        self.q_values = q_values


    def reset(self):
        ''' Resets the default values back to their original values '''
        self.state = 0
        self.cs = 0
        self.buildings = 0
        self.history_cs = []
        self.history_ci = []


    def get_reward(self) -> int:
        ''' The reward will be based on the number of buildings created '''
        return self.buildings 

    def is_game_over(self) -> bool:
        ''' Determines if all turns have been used '''
        return self.state == self.turns


    def get_bpt(self, cs: int) -> int:
        ''' Determines the current buildings per turn '''
        return (math.floor(cs/4)) + 5


    def get_next_action(self, epsilon: float) -> int:
        '''
        Returns the most likely successful action with some probability that an inferior action may happen occasionally.
        '''
        if np.random.random() < epsilon:
            return np.argmax(self.q_values[self.state])
        else:
            return np.random.randint(2)


    def get_next_state(self, action_index: int) -> int:
        ''' Executes next action and returns the next state '''
        if self.actions[action_index] == "build ci":
            new_buildings = self.get_bpt(self.cs)
            self.buildings += new_buildings
            self.history_ci.append({self.state: new_buildings})

        elif self.actions[action_index] == "build cs":
            self.cs += 1
            self.history_cs.append({self.state : 1})
 
        self.state += 1
        return self.state

    def print_best_path(self):
        self.reset()
        while not ai.is_game_over():
            action_index = self.get_next_action(1.)
            if action_index == 0:
                print(f"build ci")
            else:
                print(f"build cs")
            self.get_next_state(action_index)
        print(f"total construction sites: {self.cs}")
        print(f"total buildings: {self.buildings}")

      
TURNS = 25

ai = AI(turns=TURNS,
        learning_rate=0.9,
        discount_factor=0.9,
        actions=["build ci", "build cs"],
        q_values=np.zeros((TURNS+1, 1, 2)))


for episode in range(100000):

    ai.reset()

    action_index = None

    while not ai.is_game_over():
        action_index = ai.get_next_action(.9)
        old_state = ai.state
        next_state = ai.get_next_state(action_index) 
        if ai.buildings < ai.max_buildings:
            reward = -10
        else:
            reward = -1 

        old_q_value = ai.q_values[old_state, 0, action_index]
        temporal_difference = reward + (ai.discount_factor * np.max(ai.q_values[next_state])) - old_q_value
        new_q_value = old_q_value + (ai.learning_rate * temporal_difference)
        ai.q_values[old_state, 0, action_index] = new_q_value

    if ai.buildings > ai.max_buildings:
        ai.max_buildings = ai.buildings
        print(f"\nepisode: {episode}")
        print(ai.history_cs)
        print(ai.history_ci)
        print(f"total construction sites: {ai.cs}")
        print(f"total buildings: {ai.buildings}")
        #if ai.buildings == 126:
        #    print(ai.q_values)


    #pdb.set_trace()

#ai.print_best_path()

The problem is if when you run this program,. you'll see at 126 buildings the class identifies correctly 4 CS and the rest of buildings as CIs; but, whenever I uncomment ai.print_best_path, the q_values are incorrect. The q_values should reflect the best possible path and it's not. — celphi, Feb 27 '22 at 08:53
i don't know anything about machine learning, i just left that comment so that if someone else can help you, highlights of the question can save thier time. — Faraaz Kurawle, Feb 27 '22 at 09:26
Please **edit & update** your question to clarify *exactly* what your issue is. Such info should not be in the comments. — desertnaut, Feb 27 '22 at 23:34
Did you actually read it? It clearly states: "My overall goal is to....". If you don't know how to do it that's fine I can accept that; but, I don't need your advice on how to format my question. I provided plenty of information on what I'm trying to do in my question. I merely restated something I've already stated in my question to Faraaz. — celphi, Feb 28 '22 at 08:56
Someone with knowledge of q-learning / reinforced learning will understand the question very easily. This question has absolutely nothing to do with Tensorflow or neural networks. — celphi, Feb 28 '22 at 08:59

Reinforced Learning Example

0 Answers0