0

I'm trying to build a classifier to predict stock prices. I generated extra features using some of the well-known technical indicators and feed these values, as well as values at past points to the machine learning algorithm. I have about 45k samples, each representing an hour of ohlcv data.

The problem is actually a 3-class classification problem: with buy, sell and hold signals. I've built these 3 classes as my targets based on the (%) change at each time point. That is: I've classified only the largest positive (%) changes as buy signals, the opposite for sell signals and the rest as hold signals.

However, presenting this 3-class target to the algorithm has resulted in poor accuracy for the buy & sell classifiers. To improve this, I chose to manually assign classes based on the probabilities of each sample. That is, I set the targets as 1 or 0 based on whether there was a price increase or decrease. The algorithm then returns a probability between 0 and 1(usually between 0.45 and 0.55) for its confidence on which class each sample belongs to. I then select some probability bound for each class within those probabilities. For example: I select p > 0.53 to be classified as a buy signal, p < 0.48 to be classified as a sell signal and anything in between as a hold signal.

This method has drastically improved the classification accuracy, at some points to above 65%. However, I'm failing to come up with a method to select these probability bounds without a large validation set. I've tried finding the best probability values within a validation set of 3000 and this has improved the classification accuracy, yet the larger the validation set becomes, it is clear that the prediction accuracy in the test set is decreasing.

So, what I'm looking for is any method by which I could discern what the specific decision probabilities for each training set should be, without large validation sets. I would also welcome any other ideas as to how to improve this process. Thanks for the help!

C.Acarbay
  • 424
  • 5
  • 17

1 Answers1

0

What you are experiencing is called non-stationary process. The market movement depends on time of the event.

One way I used to deal with it is to build your model with data in different time chunks.

For example, use data from day 1 to day 10 for training, and day 11 for testing/validation, then move up one day, day 2 to day 11 for training, and day 12 for testing/validation.

you can save all your testing results together to compute an overall score for your model. this way you have lots of test data and a model that will adapt to time.

and you get 3 more parameters to tune, #1 how much data to use for train, #2 how much data for test, # per how many days/hours/data points you retrain your data.

Aiden Zhao
  • 633
  • 4
  • 15
  • Any tips on how to select decision boundaries? I feel if I select some based on the overall data I'd be missing something. ie. in a major down-trend looking for 0.47 confidence might be better even if overall the model works best for 0.49. Is there any way to adjust these boundaries as more data comes in to make the model more suitable for newer data? – C.Acarbay Jan 29 '19 at 13:22
  • after you have all your predictions for the test data, use it for a simulation, and pick your decision boundaries based on the simulation. but it is still an overall number, and based on my model my boundaries are about if > 0.65 buy; if < 0.45 sell; else hold. the buy and sell boundaries doesn't have to be symmetrically around 0.5 – Aiden Zhao Jan 29 '19 at 21:02