1

Trying to build a sklearn DecisionTreeRegressor, I'm following the steps listed here to create a very simple decision tree.

X_train = np.array([[100],[500],[1500],[3500]])
y_train = np.array([23, 43, 44, 55])


# create a regressor object
regressor = DecisionTreeRegressor(random_state = 0) 
  
# fit the regressor with X and Y data
regressor.fit(X_train, y_train)

The model works fine when predicting values that would be in the X_train interval:

y_pred = regressor.predict([[700]])
print(y_pred)
>[43.]

However, when predicting, for values higher than the interval listed in X_train, the model only predicts the max value of y_train.

X_test = np.array([[4000], [10000]])
y_pred = regressor.predict(X_test)
print(y_pred)
>[55. 55.]

How could the regression be extended using the X_test data to predict values higher than the ones listed in X_test, so that it predicts following the trend it finds for the X_train interval?

  • When you say 'extend', do you mean retrain / refit the model as new data comes in? – Capybara Jun 03 '22 at 19:28
  • And this might help answer your question regarding creating a interval by using weights to more heavily weight recent observations: https://stackoverflow.com/questions/56029577/how-to-weigh-data-points-with-sklearn-training-algorithms – Capybara Jun 03 '22 at 19:31

1 Answers1

0

Classical decision tree algorithms can't really extrapolate beyond seen dataset and to understand why you can plot your decision tree and follow its decision path.


Imports

import numpy as np
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

Tree model


X_train = np.array([[100],[500],[1500],[3500]])
y_train = np.array([23, 43, 44, 55])

# create a regressor object
regressor = DecisionTreeRegressor(random_state = 0) 
  
# fit the regressor with X and Y data
regressor.fit(X_train, y_train)

Visualized model

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(regressor, 
                   filled=True)

enter image description here


Linear model



X_train = np.array([[100],[500],[1500],[3500]])
y_train = np.array([23, 43, 44, 55])

reg = LinearRegression().fit(X_train, y_train)

x_outside_range = np.array([[4000], [10000]])

plt.plot(X_train,y_train, label='train data')
plt.plot(x_outside_range ,reg.predict(x_outside_range), label='prediction outside train data range')
plt.legend()

enter image description here

Yev Guyduy
  • 1,371
  • 12
  • 13