-1

I am trying to build a model that predicts the shipping volume of each month, week, and day. I found that the decision tree-based model works better than linear regression.

But I read some articles about machine learning and it says decision tree based model can't predict future which model didn't learn. (extrapolation issues)

So I think it means that if the data is spread between the dates that train data has, the model can predcit well, but if the date of data is out of the range, it can not.

I'd like to confirm if my understand is correct. some posting shows prediction for datetime based data using random forest model, and it makes me confused.

Also please let me know if there is any way to overcome extrapolation issues on decision tree based model.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • I'm not particularly familiar with decision trees specifically, but in general: it really depends what you're modeling. In order to know if your model can predict the future, you need to know the future. – Ryan M Apr 02 '20 at 06:24

2 Answers2

1

It depends on the data. Decision tree predicts class value of any sample in range of [minimum of class value of training data, maximum of class value of training data]. For example, let there are five samples [(X1, Y1), (X2, Y2), ..., (X5, Y5)], and well trained tree has two decision node. The first node N1 includes (X1, Y1), (X2, Y2) and the other node N2 includes (X3, Y3), (X4, Y4), and (X5, Y5). Then the tree will predict a new sample as mean of Y1 and Y2 when the sample reaches N1, but it will predict a new sample as men of Y3, Y4, Y5 when the sample reaches N2.

With this reason, if the class value of new sample could be bigger than the maximum of class value of training data or could be smaller than the minimum of class value of training data, it is not recommend to use decision tree. Otherwise, tree-based model such as random forest shows good performance.

Gilseung Ahn
  • 2,598
  • 1
  • 4
  • 11
1

There can be different forms of extrapolation issues here. As already mentioned a classical decision tree for classification can only predict values it has encountered in its training/creation process. In that sense you won't predict any previously unseen values. This issue can be remedied if you have the classifier predict relative updates instead of absolute values. But you need to have some understanding of your data, to determine what works best for different cases. Things are similar for a decision tree used for regression.

The next issue with "extrapolation" is that decision trees might perform badly if your training data has changing statistics over time. Again, I would propose to predict update relationships. Otherwise, predictions based on training data from a more recent past might yield better predictions. Since individual decision trees can't be trained in an online manner, you would have to create a new decision tree every x time steps.

Going further than this I'd say you'll want to start thinking in state machines and trying to use your classifier for state predictions. But this a fairly uncharted domain of theory for decision trees from when I last checked. This will work better if you already have some for of model for your data relationships in mind.