I'm training a deep q network to trade stocks; it has two possible actions; 0 : wait, 1 : buy stock if one isn't bought, sell one if one is bought. It gets, as input, the value of the stock it bought, the current value of the stock and the values of the stock for the previous 5 time steps relative to it. So something like
[5.78, 5.93, -0.1, -0.2, -0.4, -0.5, -0.3]
The reward is simply the difference between the price of the sale and the price of the purchase. The reward for any other action is 0, though I've tried having it be negative or something else without results.
simple, right? Unfortunately, the agent always converges on taking the "0" action. Even when I magnify the reward for selling at a profit or any number of things. I'm really pulling my hair out, is there something obvious I've missed?