Sklearn DecisionTreeRegressor - Extend prediction

Question

Trying to build a sklearn DecisionTreeRegressor, I'm following the steps listed here to create a very simple decision tree.

X_train = np.array([[100],[500],[1500],[3500]])
y_train = np.array([23, 43, 44, 55])


# create a regressor object
regressor = DecisionTreeRegressor(random_state = 0) 
  
# fit the regressor with X and Y data
regressor.fit(X_train, y_train)

The model works fine when predicting values that would be in the X_train interval:

y_pred = regressor.predict([[700]])
print(y_pred)
>[43.]

However, when predicting, for values higher than the interval listed in X_train, the model only predicts the max value of y_train.

X_test = np.array([[4000], [10000]])
y_pred = regressor.predict(X_test)
print(y_pred)
>[55. 55.]

How could the regression be extended using the X_test data to predict values higher than the ones listed in X_test, so that it predicts following the trend it finds for the X_train interval?

When you say 'extend', do you mean retrain / refit the model as new data comes in? — Capybara, Jun 03 '22 at 19:28
And this might help answer your question regarding creating a interval by using weights to more heavily weight recent observations: https://stackoverflow.com/questions/56029577/how-to-weigh-data-points-with-sklearn-training-algorithms — Capybara, Jun 03 '22 at 19:31

score 0 · Accepted Answer · answered Jun 07 '22 at 19:59

Classical decision tree algorithms can't really extrapolate beyond seen dataset and to understand why you can plot your decision tree and follow its decision path.

Imports

import numpy as np
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

Tree model


X_train = np.array([[100],[500],[1500],[3500]])
y_train = np.array([23, 43, 44, 55])

# create a regressor object
regressor = DecisionTreeRegressor(random_state = 0) 
  
# fit the regressor with X and Y data
regressor.fit(X_train, y_train)

Visualized model

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(regressor, 
                   filled=True)

Linear model



X_train = np.array([[100],[500],[1500],[3500]])
y_train = np.array([23, 43, 44, 55])

reg = LinearRegression().fit(X_train, y_train)

x_outside_range = np.array([[4000], [10000]])

plt.plot(X_train,y_train, label='train data')
plt.plot(x_outside_range ,reg.predict(x_outside_range), label='prediction outside train data range')
plt.legend()

Sklearn DecisionTreeRegressor - Extend prediction

1 Answers1