Q-learning for optimal order placement

Question

So the last thread I made about Reinforcement Learning was marked as too broad, which I totally understood. I've never worked with it before, so I'm trying to learn it on my own - not an easy task so far. Now, I've been reading some papers and tried to base my approach on what I've learned from them, but I'm not sure if what I'm doing makes sense so I'd appreciate some help here!

Basically I want to calculate how much to order each day using Q-learning. Below is the relevant part of the code - Ip, Im have already been calculated earlier in the code by calculating each day's order without reinforcement learning so I can then input that into my algorithm and train it.

I've divided my states into 9 (depending on how much stock I have) and my actions into 9 (each one meaning I should order a certain value). My reward function is my objective function which is my total cost, that I want to minimize (so it's really a "loss" function). However, this isn't really getting optimized, I presure because the Q matrix isn't being trained properly, seems to be too random. Any thoughts on how to improve/fix this code?

# Training

# Ip - on-hand inventory
# Im - Lost orders
# T - no. of days (360)
# Qt - order quantity

def reward(t):
    return h*Ip[t]+b*Im[t]

Q = np.matrix(np.zeros([9,9]))

iteration = 0
t = 0
MAX_ITERATION = 500
alp = 0.2 # learning rate (between 0 and 1)
exploitation_p = 0.15 # exploitation probability (incresed after each iteration until it reaches 1)

while iteration <= MAX_ITERATION:
    while t < T-1:
        if Ip[t] <= 8:
            state = 0
        if Ip[t] > 8 and Ip[t] <= 14:
            state = 1
        if Ip[t] > 14 and Ip[t] <= 20:
            state = 2
        if Ip[t] > 20 and Ip[t] <= 26:
            state = 3
        if Ip[t] > 26 and Ip[t] <= 32:
            state = 4
        if Ip[t] > 32 and Ip[t] <= 38:
            state = 5
        if Ip[t] > 38 and Ip[t] <= 44:
            state = 6
        if Ip[t] > 44 and Ip[t] <= 50:
            state = 7
        if Ip[t] > 50:
            state = 8

        rd = random.random()
        if rd < exploitation_p:
            action = np.where(Q[state,] == np.max(Q[state,]))[1]
            if np.size(action) > 1:
                action = np.random.choice(action,1)
        elif rd >= exploitation_p:
            av_act = np.where(Q[state,] < 999999)[1]
            action = np.random.choice(av_act,1)
        action = int(action)
        rew = reward(t+1)

        if Ip[t+1] <= 8:
            next_state = 0
        if Ip[t+1] > 8 and Ip[t+1] <= 14:
            next_state = 1
        if Ip[t+1] > 14 and Ip[t+1] <= 20:
            next_state = 2
        if Ip[t+1] > 20 and Ip[t+1] <= 26:
            next_state = 3
        if Ip[t+1] > 26 and Ip[t+1] <= 32:
            next_state = 4
        if Ip[t+1] > 32 and Ip[t+1] <= 38:
            next_state = 5
        if Ip[t+1] > 38 and Ip[t+1] <= 44:
            next_state = 6
        if Ip[t+1] > 44 and Ip[t+1] <= 50:
            next_state = 7
        if Ip[t+1] > 50:
            next_state = 8

        next_action = np.where(Q[next_state,] == np.max(Q[next_state,]))[1]
        if np.size(next_action) > 1:
            next_action = np.random.choice(next_action,1)
        next_action = int(next_action)

        Q[state, action] = Q[state, action] + alp*(-rew+Q[next_state, next_action]-Q[state, action])

        t += 1
    if (exploitation_p < 1):
        exploitation_p = exploitation_p + 0.05
    t = 0
    iteration += 1

# Testing

Ip = [0] * T
Im = [0] * T

It[0] = I0 - d[0]

if (It[0] >= 0):
    Ip[0] = It[0]
else:
    Im[0] = -It[0]

Qt[0] = 0
Qbase = 100

sumIp = Ip[0]
sumIm = Im[0]

i = 1
while i < T:
    if (i - LT >= 0):
        It[i] = Ip[i-1] - d[i] + Qt[i-LT]
    else:
        It[i] = Ip[i-1] - d[i]
    It[i] = round(It[i], 0)
    if It[i] >= 0:
        Ip[i] = It[i]
    else:
        Im[i] = -It[i]

    if Ip[i] <= 8:
        state = 0
    if Ip[i] > 8 and Ip[i] <= 14:
        state = 1
    if Ip[i] > 14 and Ip[i] <= 20:
        state = 2
    if Ip[i] > 20 and Ip[i] <= 26:
        state = 3
    if Ip[i] > 26 and Ip[i] <= 32:
        state = 4
    if Ip[i] > 32 and Ip[i] <= 38:
        state = 5
    if Ip[i] > 38 and Ip[i] <= 44:
        state = 6
    if Ip[i] > 44 and Ip[i] <= 50:
        state = 7
    if Ip[i] > 50:
        state = 8

    action = np.where(Q[state,] == np.max(Q[state,]))[1]
    if np.size(action) > 1:
        action = np.random.choice(action,1)
    action = int(action)

    if action == 0:
        Qt[i] = Qbase
    if action == 1:
        Qt[i] = Qbase * 0.95
    if action == 2:
        Qt[i] = Qbase * 0.9
    if action == 3:
        Qt[i] = Qbase * 0.85
    if action == 4:
        Qt[i] = Qbase * 0.8
    if action == 5:
        Qt[i] = Qbase * 0.75
    if action == 6:
        Qt[i] = Qbase * 0.7
    if action == 7:
        Qt[i] = Qbase * 0.65
    if action == 8:
        Qt[i] = Qbase * 0.6

    sumIp = sumIp + Ip[i]
    sumIm = sumIm + Im[i]

    i += 1

objfunc = h*sumIp+b*sumIm

print(objfunc)

In case you want/need to run it, this is my full code: https://pastebin.com/vU5V0ehg

Thanks in advance!

P.S. I guess my MAX_ITERATION should be higher (most papers seem to use 10000), but my computer takes too long to run the program in that case, hence why I'm using 500.

@rui-nian sorry for tagging you, you don't have to answer if you don't want to... you commented on my last post about RL saying I can use it to calculate which orders to place depending on my stock... any idea what I am doing wrong here? — Sergio, Jan 22 '19 at 14:59

Q-learning for optimal order placement

0 Answers0