I am implementing a Reinforcement Learning agent that takes action given a time series of prices. The actions are, classically, buy sell or wait. The neural network gets as input one batch at the time, the window size is 96 steps, and I have around 80 features. So the input is something like 1x96x80. The algorithm is online and takes a random sample, every 96 new observations, from a replay memory that saves the last 480 observations (s,a,r,s'). I give a reward signal for each action for each timestep, and where the reward for buy is +1, the one for sell is -1 and so on. Thus I do not have to bother about exploration. I am using the standard way of calculating the loss (as in the Deep Mind original DQN paper), with two networks, one for the estimation of the Q values and the other that acts as a target and gets a soft update every step.
The agent picks at each step the action with the highest estimated Q values - as shown in the graphs below. My issue is with how the Q values behave depending on the architecture of the model. In the standard case I have two dense 'elu' layers, two LSTM layers and a final dense 'linear' layer with 3 units. With this configuration, the Q values fluctuate too heavily, I get a new high max every almost every step and the greedy policy picks different actions too frequently incurring in high transaction costs that destroy the performance (fig 1). In the other case (fig 2) I simply add another dense linear layer of 3 units before the last one, and in this case the Q values evolve much slower, increasing performance as it doesn't incur into high costs, but the tradeoff here is that I get a much less efficient learner which is slow in adapting to new conditions and keeps picking suboptimal actions for longer time, thus harming performance (but still, way better than before). For completeness, I tried with the LSTM returning the whole sequence and updating the gradient on that, and with only the last step. No real difference between the two.
Without a second linear layer With a double linear layer The blue line is sell (and keep the short position), the orange is wait or close position, and the green is buy (or keep the long position). A position is therefore kept for longer periods if a curve is consistently higher than the others.
Ideally, I would like to find a way to fine tune between these two extremes, however, the second behavior appears only when I add a second dense layer of 3 units before the last one (it can be linear again or tanh), therefore I am not able to get all possibilities in between. Tuning other parameters, like the discount factor, the learning rate or the bias for LSTM does not really help (increasing the bias for LSTM to 5 does help but only in the first iterations, the it goes back to the same behaviour). Also, using GRU instead of LSTM does not change significantly the dynamics of the Q values.
I am at a dead end, any suggestion? I also do not understand why adding a simple linear final layer slows down the estimation of Q values so much.
EDIT: After enough iterations, even the second case (2 linear layers) slowly converge to the case where Q values are way too volatile. So the desired behaviour only last a few tens thousands of steps.