The following has completed using MATLAB.
I am trying to build a trading algorithm using Deep Q learning. I have just taken a years worth of daily stock prices and am using that as the training set.
My state space is my [money, stock, price]
money
is the amount of cash I have,
stock
is the number of stocks I have, and
price
is the price of the stock at that time step.
The issue I am having is with the actions; looking online, people only have three actions, { buy | sell | hold }
.
My reward function is the difference between the value of portfolio value in the current time step and the previous time step.
But using just three actions, I am unsure how to choose to buy, lets say 67 stocks at the price?
I am using a neural network to approximate the q-values. It has three inputs
[money, stock, price]
and 202 outputs, i.e. I can sell between 0 and 100 number of stock, 0, I can hold the stock, or I can buy 1-100 stock.
Can anyone shed some light on the how can I reduce this to 3 actions?
My code is :
% p is the stock price
% sp is the stock price at the next time interval
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidden_layers = 1;
actions = 202;
net = newff( [-1000000 1000000;-1000000 1000000;0 1000;],
[hidden_layers, actions],
{'tansig','purelin'},
'trainlm'
);
net = init( net );
net.trainParam.showWindow = false;
% neural network training parameters -----------------------------------
net.trainParam.lr = 0.01;
net.trainParam.mc = 0.1;
net.trainParam.epochs = 100;
% parameters for q learning --------------------------------------------
epsilon = 0.8;
gamma = 0.95;
max_episodes = 1000;
max_iterations = length( p ) - 1;
reset = false;
inital_money = 1000;
inital_stock = 0;
%These will be where I save the outputs
save_s = zeros( max_iterations, max_episodes );
save_pt = zeros( max_iterations, max_episodes );
save_Q_target = zeros( max_iterations, max_episodes );
save_a = zeros( max_iterations, max_episodes );
% construct the inital state -------------------------------------------
% a = randi( [1 3], 1, 1 );
s = [inital_money;inital_stock;p( 1, 1 )];
% construct initial q matrix -------------------------------------------
Qs = zeros( 1, actions );
Qs_prime = zeros( 1, actions );
for i = 1:max_episodes
for j = 1:max_iterations % max_iterations --------------
Qs = net( s );
%% here we will choose an action based on epsilon-greedy strategy
if ( rand() <= epsilon )
[Qs_value a] = max(Qs);
else
a = randi( [1 202], 1, 1 );
end
a2 = a - 101;
save_a(j,i) = a2;
sp = p( j+1, 1 ) ;
pt = s( 1 ) + s( 2 ) * p( j, 1 );
save_pt(j,i) = pt;
[s_prime,reward] = simulateStock( s, a2, pt, sp );
Qs_prime = net( s_prime );
Q_target = reward + gamma * max( Qs_prime );
save_Q_target(j,i) = Q_target;
Targets = Qs;
Targets( a ) = Q_target;
save_s( j, i ) = s( 1 );
s = s_prime;
end
epsilon = epsilon * 0.99 ;
reset = false;
s = [inital_money;inital_stock;p(1,1)];
end
% ----------------------------------------------------------------------
function[s_prime,reward] = simulateStock( s, a, pt, sp )
money = s(1);
stock = s(2);
price = s(3);
money = money - a * price ;
money = max( money, 0 );
stock = s(2) + a;
stock = max( stock, 0 );
s_prime = [money;stock;sp];
reward = ( money + stock * price ) - pt;
end