Trading algorithm - actions in Q-learning/DQN

Question

The following has completed using MATLAB.

I am trying to build a trading algorithm using Deep Q learning. I have just taken a years worth of daily stock prices and am using that as the training set.

My state space is my [money, stock, price]
money is the amount of cash I have,
stock is the number of stocks I have, and
price is the price of the stock at that time step.

The issue I am having is with the actions; looking online, people only have three actions, { buy | sell | hold }.

My reward function is the difference between the value of portfolio value in the current time step and the previous time step.

But using just three actions, I am unsure how to choose to buy, lets say 67 stocks at the price?

I am using a neural network to approximate the q-values. It has three inputs [money, stock, price] and 202 outputs, i.e. I can sell between 0 and 100 number of stock, 0, I can hold the stock, or I can buy 1-100 stock.

Can anyone shed some light on the how can I reduce this to 3 actions?

My code is :

%  p is the stock price
% sp is the stock price at the next time interval 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

hidden_layers =   1;
actions       = 202;
net           = newff( [-1000000 1000000;-1000000 1000000;0 1000;],
                       [hidden_layers, actions],
                       {'tansig','purelin'},
                       'trainlm'
                       );

net           = init( net );

net.trainParam.showWindow = false;

% neural network training parameters -----------------------------------
net.trainParam.lr     =   0.01;
net.trainParam.mc     =   0.1;
net.trainParam.epochs = 100;

% parameters for q learning --------------------------------------------
epsilon        =    0.8;
gamma          =    0.95;
max_episodes   = 1000;
max_iterations = length( p ) - 1;

reset          =    false;
inital_money   = 1000;
inital_stock   =    0;

%These will be where I save the outputs
save_s        = zeros( max_iterations, max_episodes );
save_pt       = zeros( max_iterations, max_episodes );
save_Q_target = zeros( max_iterations, max_episodes );
save_a        = zeros( max_iterations, max_episodes );

% construct the inital state -------------------------------------------
% a           = randi( [1 3], 1, 1 );  
s             = [inital_money;inital_stock;p( 1, 1 )];


% construct initial q matrix -------------------------------------------
Qs            = zeros( 1, actions );
Qs_prime      = zeros( 1, actions );


for     i = 1:max_episodes
    for j = 1:max_iterations             % max_iterations --------------

        Qs = net( s );

        %% here we will choose an action based on epsilon-greedy strategy

        if ( rand() <= epsilon )
            [Qs_value  a] = max(Qs);
        else 
            a = randi( [1 202], 1, 1 );
        end

        a2                 = a - 101;
        save_a(j,i)        = a2;
        sp                 = p( j+1, 1 ) ; 
        pt                 = s( 1 ) + s( 2 ) * p( j, 1 );
        save_pt(j,i)       = pt; 
        [s_prime,reward]   = simulateStock( s, a2, pt, sp );

        Qs_prime           = net( s_prime );

        Q_target           = reward + gamma * max( Qs_prime );
        save_Q_target(j,i) = Q_target;
        Targets            = Qs;

        Targets( a )       =  Q_target;

        save_s( j, i )     = s( 1 );
        s                  = s_prime;
    end

    epsilon = epsilon * 0.99 ; 
    reset   = false; 
    s       = [inital_money;inital_stock;p(1,1)];
end

% ----------------------------------------------------------------------
function[s_prime,reward] = simulateStock( s, a, pt, sp )
                           money   = s(1);
                           stock   = s(2);
                           price   = s(3);

                           money   = money - a * price ;
                           money   = max( money, 0 );
                           stock   = s(2) + a;
                           stock   = max( stock, 0 );

                           s_prime = [money;stock;sp];
                           reward  = ( money + stock * price ) - pt;
end

This question seems too broad to me. Avoid asking multiple questions at once. I'm unsure if someone will be able to answer this question. — m7913d, Jun 06 '17 at 10:06
Thanks, I have just started. I am doing a project in my own time. Everything is fair. — usman Farooq, Jun 06 '17 at 11:24
thanks m7913d , I am more interested in the actions part than the convergence. — usman Farooq, Jun 06 '17 at 11:24

score 3 · Answer 1 · edited Aug 15 '21 at 10:09

Actions: ill-defined
^{( if not giving an ultimate reason for so flattened, decaffeinated & knowingly short-cut model )}

You may be right, that using a range of just { buy | hold | sell } actions is a frequent habit for academic papers, where authors sometimes decide to illustrate their demonstrated academic efforts on improving learning / statistical methods and opt to pick an exemplary application in a trading domain. The pity is, this could be done in academic papers, but not in the reality of trading.

Why?

Even with an elementary view on trading, the problem is much more complex. As a brief reference, there are more than five principal domains of such model-space. Given a trading is to be modelled, one cannot remain without a fully described strategy --

Tru-Strategy := {    SelectPOLICY,
                     DetectPOLICY,
                        ActPOLICY,
                   AllocatePOLICY,
                  TerminatePOLICY
                  }

Any whatever motivated simplification, that would opt to omit any single one domain of these five principal domains will become whatever but a truly Trading Strategy.

One can easily figure out, what comes out of just training ( the worse from later harnessing such model in doing real trades with ) an ill-defined model, that is not coherent with the reality.

Sure, it can reach ( and will ( again, unless ill-formulated minimiser's criterion function ) ) some mathematical function's minimum, but that does not ensure the reality to immediately change it's so far natural behaviours and to start "obey" the ill-defined model and to "dance" according to such oversimplified or otherwise skewed ( ill-modelled )-opinions about the reality.

Rewards: ill-defined
^{( if not giving a reason for ignoring the fact or delayed rewards )}

If in doubts what this means, try to follow an example:
Today, the Strategy-Model decides to A:Buy(AAPL,67).
Tomorrow, AAPL goes down, some 0.1% and thus the immediate reward ( as was proposed above ) is negative, thus punishing such decision. The Model is stimulated not to do it ( do not buy AAPL ).

The point is, that after some period of time, AAPL rises much higher, producing much higher reward compared to initial fluctuations in D2D Close, which is known, but the proposed Strategy-Model Q-fun simply principally erroneously did not reflect at all.

Beware WYTIWYG -- What You Train Is What You Get ...

This means an as-is-Model could be trained to act according to the such defined stimuli, but it's actual behaviour will favour NOTHING but such extremely naive intraday "quasi-scalping" shots with limited ( if any at all ) support from actual Market State & Market Dynamics as are available by many industry-wide accepted quantitative models.

So, sure, one can train a reality-blind model, that was kept blind & deaf ( ignoring the reality of the Problem Domain ), but for what sake?

Epilogue:

There is nothing like a "Data Science"
_{even when MarCom & HR beat their drums & whistles, as they indeed do a lot nowadays}

Why?

Exactly because the above observed rationale. Having data-points is nothing. Sure, it is better than standing clueless in front of the customer without a single observation of the reality, but the Data-points do not save the game.

It is the domain-knowledge, that starts to make some sense from the Data-points, not the Data-points per se.

If still in doubts, if one has a few terabytes of numbers, there is no Data Science to tell you, what the data-points represent.

On the other hand, if one knows, from the domain-specific context, these data-points ought be temperature readings, there is still no Data-Science God to tell you, whether there are all ( just by coincidence ) in [°K] or [°C] ( if there are just positive readings >= 0.00001 ).

Thank you for your comments. I understand your statements about academia papers, but the truth is that is what I am currently trying to replicate. This chap has published a post that I am trying to replicate.http://hallvardnydal.github.io/2016/03/12/deep_q/ Just a basic buy, sell ,hold scenario. I think the issue might be with my simulateStock function, but when I run it, the systems seems to converge to some Q-value, as well as giving me a set of positive returns. I am training it on one year of data and testing it on three. — usman Farooq, Jun 06 '17 at 12:38
Enjoy your time. I respect a decision to spend time to experimentally prove if a smoke, blown into water will not create a pure gold, but I thought it might be helpfull to share the piece of an already available experience, that it will not. You've got my [+1] for your interest in doing science. — user3666197, Jun 06 '17 at 12:43
Could you provide some insight on how to just use three actions? As mentioned, i am unsure how they have set up the problem just to use three actions, how they would pick the number of stocks to buy/sell from just three actions. Appreciate the [+1]. — usman Farooq, Jun 06 '17 at 12:57
Hi @user3666197, thanks for the detailed answer. Could you please add a brief explanation about the five domains? E.g., what is the difference between Select and Detect? — Pablo EM, Jun 08 '17 at 14:47

Trading algorithm - actions in Q-learning/DQN

1 Answers1

Actions: ill-defined ( if not giving an ultimate reason for so flattened, decaffeinated & knowingly short-cut model )

Why?

Rewards: ill-defined ( if not giving a reason for ignoring the fact or delayed rewards )

Beware WYTIWYG -- What You Train Is What You Get ...

Actions: ill-defined
^{( if not giving an ultimate reason for so flattened, decaffeinated & knowingly short-cut model )}

Rewards: ill-defined
^{( if not giving a reason for ignoring the fact or delayed rewards )}