I have programmed a reinforcement model with a DQN approach that is supposed to make purchase decisions based on stock prices.
For the training I use two stock prices. One has an upward trend and one has a downward trend. The time period for both is 1 year (100,000 data points).
As observation I use the price data of the last 1000 data points.
For the training I first collect 100 episodes (one episode is one run of the complete stock price, where the stock price (upward/downward trend is chosen randomly). Per episode I get about 1000 actions (buy, sell, skip).
Then the training takes place with a batch size of 64.
The problem is that the model specializes on one of the stock prices and generates a good reward there. For the other stock price, however, it is very bad and I get a negative reward.
It seems that the model does not try to optimize the average profit over all episodes (upward/downward trend).
As a reward I simply take the money I make per trade in profit or loss. As descout I have set 1.0.
Does anyone have an idea what the problem could be.