2

In a VBasedPolicy the neural network approximator tries to learn the V values of states. So it's first layer (input layer) should have same number of neurons as the size of state. And I believe it's last (output) layer should have a size of 1, since it will produce a single scalar corresponding to input state.

But when I use such a network architecture, I get BoundsError from trajectory. The details are as follows.

I saw various example experiments on github repo of the library, but all examples on BasicDQN employ QBasedPolicy where last layer of network has size equal to number of actions, which makes sense to me because given a state the network has to output Q values for each action.

I went through the code on github and precisely the error comes on line 79 of this page. But I couldn't solve it.

Code for single agent policy (VBasedPolicy)

STATE_SIZE = length(env.channels) # 2
ACTION_SIZE = length(action_set)    # 7    
model = Chain(
            Dense(STATE_SIZE, 128, relu),
            Dense(128, 128, relu),
            Dense(128, 128, relu),
            Dense(128, 128, relu),
            Dense(128, 128, relu),
            Dense(128, 128, relu),
            Dense(128, 128, relu),
            Dense(128, 128, relu),
            Dense(128, 128, relu),
            Dense(128, 1)
        ) |> cpu

# optimizer 
η = 1f-3 # Learning rate
η_decay = 1f-4
opt = Flux.Optimiser(ADAM(η), InvDecay(η_decay))

function value_action_mapping(env, value_learner; explorer = EpsilonGreedyExplorer(0.4))
    A = legal_action_space(env)
    println("legal action space: ", A)
    V = map(A) do a
        value_learner(child(env, a))
    end
    println("V values: ", V)
    c = A[explorer(V)]
    println("Chosen action: ", c)
    println("Action with max V val: ", findmax(V))
    return c
end

single_agent_policy = Agent(
            policy = VBasedPolicy(;
                    learner = BasicDQNLearner(;
                        approximator = NeuralNetworkApproximator(;
                            model = model,
                            optimizer = opt
                        ),
                        min_replay_history = 50,
                        batch_size = 50,
                        γ = 0.99
                    ),
                    mapping = value_action_mapping
                ),
                trajectory = CircularArraySARTTrajectory(;
                            capacity = 100,
                            state=Array{Float64} => (STATE_SIZE)
                        )
                )

Here is the error
BoundsError: attempt to access 1×50 Matrix{Float32} at index [CartesianIndex{2}[CartesianIndex(3, 1), CartesianIndex(1, 2), CartesianIndex(3, 3), CartesianIndex(4, 4), CartesianIndex(5, 5), CartesianIndex(4, 6), CartesianIndex(4, 7), CartesianIndex(3, 8), CartesianIndex(4, 9), CartesianIndex(4, 10)  …  CartesianIndex(5, 41), CartesianIndex(4, 42), CartesianIndex(3, 43), CartesianIndex(4, 44), CartesianIndex(5, 45), CartesianIndex(5, 46), CartesianIndex(6, 47), CartesianIndex(4, 48), CartesianIndex(4, 49), CartesianIndex(1, 50)]]

Stacktrace:
  [1] throw_boundserror(A::Matrix{Float32}, I::Tuple{Vector{CartesianIndex{2}}})
    @ Base .\abstractarray.jl:651
  [2] checkbounds
    @ .\abstractarray.jl:616 [inlined]
  [3] _getindex
    @ .\multidimensional.jl:831 [inlined]
  [4] getindex
    @ .\abstractarray.jl:1170 [inlined]
  [5] adjoint
    @ C:\Users\vchou\.julia\packages\Zygote\i1R8y\src\lib\array.jl:31 [inlined]
  [6] _pullback(__context__::Zygote.Context, 496::typeof(getindex), x::Matrix{Float32}, inds::Vector{CartesianIndex{2}})
    @ Zygote C:\Users\vchou\.julia\packages\ZygoteRules\OjfTt\src\adjoint.jl:57
  [7] _pullback
    @ C:\Users\vchou\.julia\packages\ReinforcementLearningZoo\M308M\src\algorithms\dqns\basic_dqn.jl:79 [inlined]
  [8] _pullback(::Zygote.Context, ::ReinforcementLearningZoo.var"#52#54"{BasicDQNLearner{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Flux.Optimise.Optimiser}, typeof(Flux.Losses.huber_loss), Random._GLOBAL_RNG}, Matrix{Float64}, Vector{Bool}, Vector{Float32}, Matrix{Float64}, typeof(Flux.Losses.huber_loss), Float32, NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Flux.Optimise.Optimiser}})
    @ Zygote C:\Users\vchou\.julia\packages\Zygote\i1R8y\src\compiler\interface2.jl:0
  [9] pullback(f::Function, ps::Zygote.Params)
    @ Zygote C:\Users\vchou\.julia\packages\Zygote\i1R8y\src\compiler\interface.jl:250
 [10] gradient(f::Function, args::Zygote.Params)
    @ Zygote C:\Users\vchou\.julia\packages\Zygote\i1R8y\src\compiler\interface.jl:58
 [11] update!(learner::BasicDQNLearner{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Flux.Optimise.Optimiser}, typeof(Flux.Losses.huber_loss), Random._GLOBAL_RNG}, batch::NamedTuple{(:state, :action, :reward, :terminal, :next_state), Tuple{Matrix{Float64}, Vector{Int64}, Vector{Float32}, Vector{Bool}, Matrix{Float64}}})
    @ ReinforcementLearningZoo C:\Users\vchou\.julia\packages\ReinforcementLearningZoo\M308M\src\algorithms\dqns\basic_dqn.jl:78
 [12] update!(learner::BasicDQNLearner{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Flux.Optimise.Optimiser}, typeof(Flux.Losses.huber_loss), Random._GLOBAL_RNG}, traj::CircularArraySARTTrajectory{NamedTuple{(:state, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float64, 2}, CircularArrayBuffers.CircularVectorBuffer{Int64}, CircularArrayBuffers.CircularVectorBuffer{Float32}, CircularArrayBuffers.CircularVectorBuffer{Bool}}}})
    @ ReinforcementLearningZoo C:\Users\vchou\.julia\packages\ReinforcementLearningZoo\M308M\src\algorithms\dqns\basic_dqn.jl:65
 [13] update!
    @ C:\Users\vchou\.julia\packages\ReinforcementLearningCore\NWrFY\src\policies\q_based_policies\learners\abstract_learner.jl:35 [inlined]
 [14] update!
    @ C:\Users\vchou\.julia\packages\ReinforcementLearningCore\NWrFY\src\policies\v_based_policies.jl:31 [inlined]
 [15] (::Agent{VBasedPolicy{BasicDQNLearner{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Flux.Optimise.Optimiser}, typeof(Flux.Losses.huber_loss), Random._GLOBAL_RNG}, typeof(value_action_mapping)}, CircularArraySARTTrajectory{NamedTuple{(:state, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float64, 2}, CircularArrayBuffers.CircularVectorBuffer{Int64}, CircularArrayBuffers.CircularVectorBuffer{Float32}, CircularArrayBuffers.CircularVectorBuffer{Bool}}}}})(stage::PreActStage, env::AdSpendEnv, action::Int64)
    @ ReinforcementLearningCore C:\Users\vchou\.julia\packages\ReinforcementLearningCore\NWrFY\src\policies\agents\agent.jl:74
 [16] _run(policy::Agent{VBasedPolicy{BasicDQNLearner{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Flux.Optimise.Optimiser}, typeof(Flux.Losses.huber_loss), Random._GLOBAL_RNG}, typeof(value_action_mapping)}, CircularArraySARTTrajectory{NamedTuple{(:state, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float64, 2}, CircularArrayBuffers.CircularVectorBuffer{Int64}, CircularArrayBuffers.CircularVectorBuffer{Float32}, CircularArrayBuffers.CircularVectorBuffer{Bool}}}}}, env::AdSpendEnv, stop_condition::StopAfterEpisode{ProgressMeter.Progress}, hook::ComposedHook{Tuple{RewardPerStep, ActionsPerStep, TotalRewardPerEpisode, NeuralOutputPerStep, StatePerStep}})
    @ ReinforcementLearningCore C:\Users\vchou\.julia\packages\ReinforcementLearningCore\NWrFY\src\core\run.jl:28
 [17] run(policy::Agent{VBasedPolicy{BasicDQNLearner{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Flux.Optimise.Optimiser}, typeof(Flux.Losses.huber_loss), Random._GLOBAL_RNG}, typeof(value_action_mapping)}, CircularArraySARTTrajectory{NamedTuple{(:state, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float64, 2}, CircularArrayBuffers.CircularVectorBuffer{Int64}, CircularArrayBuffers.CircularVectorBuffer{Float32}, CircularArrayBuffers.CircularVectorBuffer{Bool}}}}}, env::AdSpendEnv, stop_condition::StopAfterEpisode{ProgressMeter.Progress}, hook::ComposedHook{Tuple{RewardPerStep, ActionsPerStep, TotalRewardPerEpisode, NeuralOutputPerStep, StatePerStep}})
    @ ReinforcementLearningCore C:\Users\vchou\.julia\packages\ReinforcementLearningCore\NWrFY\src\core\run.jl:10
 [18] top-level scope
    @ In[1211]:2
 [19] eval
    @ .\boot.jl:360 [inlined]
 [20] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base .\loading.jl:1094
  • 1
    The output of your `model` is a vector of length `1` instead of a scalar. I guess this is the main reason. By the way, I think your model is way too large for such a simple state of size 2 You can find some usages of `VBasedPolicy` here: https://github.com/JuliaReinforcementLearning/ReinforcementLearningAnIntroduction.jl/search?q=VBasedPolicy – Jun Tian Jun 15 '21 at 09:59
  • yeah model is large but it will be scaled in future to 1000 of channels. – KnownUnknown Jun 15 '21 at 13:38
  • Can you tell how can I convert the output to a scalar instead of 1 size vector? – KnownUnknown Jun 15 '21 at 14:03
  • In the source code for updating the BasicDQN with batches, it uses Q(s), where Q is approximator and s is batch of states, to get V values. We want this to be of dims (batch_size, ). So I believe we need to add some function as activation of last layer that will extract the first index of output. But I can't find one :( – KnownUnknown Jun 15 '21 at 14:51
  • Yes, that sounds reasonable. But as I showed in the above link, I only used VBasedPolicy with some simple tabular methods before. So if you want to adapt it to Flux models, you may need a bit of work here. First, `BasicDQNLearner` is not appropriate here, it is meant to be used in `QBasedPolicy` only. This means you need to write your own learner to decide how to update the learner and the trajectory. An example can be found in https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/master/src/ReinforcementLearningZoo/src/algorithms/tabular/monte_carlo_learner.jl – Jun Tian Jun 15 '21 at 15:21
  • I understand it's not easy for new users. But I'd be happy to provide help if you can create an issue and describe what kind of problem you're trying to solve. – Jun Tian Jun 15 '21 at 15:23
  • i am trying to solve the budget allocation problem using RL we have a budget and we need to distribute over certain number of advertising channel over a period of lets say n weeks. So I have formulated this as an MDP as follows state: [spend_ch_1, spend_ch_2, etc], action: [increase_spend_ch_1, decrease_ch_2, etc] this is one of many possible actions. Transition function is just assignment of the new state by applying the action on current spend. We have constraints like spend for each channel should be in some range, overall budget should not be exceeded etc. Goal is to maximize sales. – KnownUnknown Jun 15 '21 at 15:53
  • I kept the model so complex because it was not able to learn that it should not exceed the budget. I was giving it a negative reward if it exceeded the over all budget – KnownUnknown Jun 15 '21 at 15:59
  • 1
    I see. I'd prefer to use A2C on this problem first. The output of the critic part in the model should give you an estimation of state value if you really need it. – Jun Tian Jun 16 '21 at 02:51
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233831/discussion-between-knownunknown-and-jun-tian). – KnownUnknown Jun 16 '21 at 08:43

0 Answers0