Fitting Ensemble Regressor within a loop generates repeat values

Question

I'm trying to use an ensemble regressor to predict production based on a couple of material measurements. My data is annual, going back to 1965. (Some details stripped out and random data used because this is for a work project using sensitive data.)

I've stripped my code down to the bare minimum and I'm still seeing the issue:

import pandas as pd
import numpy as np

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost.sklearn import XGBRegressor

X_past = pd.DataFrame(index = range(1965, 2020), data = dict(
    A = np.random.randint(4170, 19091, size = 55),
    B = np.random.randint(74, 337, size = 55)
))

X_future = pd.DataFrame(index = range(2020, 2023), data = dict(
    A = np.random.randint(4170, 19091, size = 3),
    B = np.random.randint(74, 337, size = 3)
))

y_past = pd.DataFrame(index = range(1965, 2020), data = dict(
    C = np.random.randint(12163, 42580, size = 55)
))

predictions = None
predictions = pd.DataFrame()

i = 0

while i < 10:
    i += 1
    
    reg = None
    y_pred = None
    
    X = X_past.values
    y = y_past.values.ravel()

    #reg = RandomForestRegressor(n_estimators = 300)
    reg = GradientBoostingRegressor(n_estimators = 300)
    #reg = XGBRegressor(n_estimators = 640, silent = True)

    reg.fit(X, y)

    y_pred = reg.predict(np.array(X_future))
    predictions = predictions.append(pd.Series(y_pred), ignore_index = True,)
    

predictions.columns = [2020, 2021, 2022]
predictions['Row-wise Duplicates'] = (predictions[2021] == predictions[2022])

predictions

That produces results such as:

2020	2021	2022	Row-wise Duplicates
13211.008045	29624.483861	34110.523735	False
13211.008045	29624.483861	33462.196606	False
13211.008045	29624.483861	33867.781932	False
13211.008045	29624.483861	33999.203849	False
13211.008045	29624.483861	33947.950436	False
13211.008045	29624.483861	33550.338744	False
13211.008045	29624.483861	34079.297200	False
13211.008045	29624.483861	33924.349324	False
13211.008045	29624.483861	33195.847833	False
13211.008045	29624.483861	33922.391200	False

As you can see, despite fitting anew on each iteration, I'm seeing a lot of repeat values.

I also sometimes see duplication of values across the years (usually 2021 matching 2022, which is why I calculate the Row-wise Duplicates column):

2020	2021	2022	Row-wise Duplicates
40819.929316	40819.929316	40819.929316	True
41516.312213	41516.312213	41516.312213	True
41516.312213	41516.312213	41516.312213	True
40901.743937	40901.743937	40901.743937	True
41191.025907	41191.025907	41191.025907	True
41109.211286	41109.211286	41109.211286	True
40910.834451	40910.834451	40910.834451	True
41799.581630	41799.581630	41799.581630	True
42512.531092	42512.531092	42512.531092	True
41018.306151	41018.306151	41018.306151	True

What am I doing wrong? Why am I seeing duplicates like this? And how can I fix it?

You should at least mention where you import `GradientBoostingRegressor` from (I guess scikit-learn but do I have to guess? :) ) — Itamar Katz, Dec 09 '20 at 17:36
Oh, whoops... Yeah, sklearn. XGBoostingRegressor comes from xgboost. — Bob, Dec 09 '20 at 17:40
You give the algorithm the same training input and the same test input in each iteration, why do you expect a different output? — Itamar Katz, Dec 09 '20 at 17:50

score 1 · Accepted Answer · answered Dec 09 '20 at 18:02

1

The algorithm you use, with the parameters you use, has no random internal element. So giving it the same training set and the same test set (as you do in your code) will produce the same results.

You can use the subsample parameter with value smaller then 1 to make it use a different random sub-sample to train each base learner (see documentation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)

So, if you replace your line with this one:

reg = GradientBoostingRegressor(n_estimators = 300, subsample = 0.9)

The algorithm will use a random subset of 90% of your data to train each learner, and you will get different results in each call. You can still make the results reproducible if you combine it with the random_state parameter.

answered Dec 09 '20 at 18:02

Itamar Katz

9,544
5
42
74

This makes sense, but then why is it sometimes different from row to row? Also, why does it sometimes duplicate results for the 2021 and 2022 columns? Those should be predicted off of three different feature arrays. – Bob Dec 09 '20 at 18:10
I don't know, with the input from your question it produces identical rows. But I only run it for several times. – Itamar Katz Dec 09 '20 at 18:12
Are you sure the different rows are not produced when you just run the script several times (vs different from iteration i to j)? Because that is caused by the random input you create... sorry if it's trivial but sometimes the simple things are hard to see :) – Itamar Katz Dec 09 '20 at 18:14

Fitting Ensemble Regressor within a loop generates repeat values

1 Answers1