I'm trying to use an ensemble regressor to predict production based on a couple of material measurements. My data is annual, going back to 1965. (Some details stripped out and random data used because this is for a work project using sensitive data.)
I've stripped my code down to the bare minimum and I'm still seeing the issue:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost.sklearn import XGBRegressor
X_past = pd.DataFrame(index = range(1965, 2020), data = dict(
A = np.random.randint(4170, 19091, size = 55),
B = np.random.randint(74, 337, size = 55)
))
X_future = pd.DataFrame(index = range(2020, 2023), data = dict(
A = np.random.randint(4170, 19091, size = 3),
B = np.random.randint(74, 337, size = 3)
))
y_past = pd.DataFrame(index = range(1965, 2020), data = dict(
C = np.random.randint(12163, 42580, size = 55)
))
predictions = None
predictions = pd.DataFrame()
i = 0
while i < 10:
i += 1
reg = None
y_pred = None
X = X_past.values
y = y_past.values.ravel()
#reg = RandomForestRegressor(n_estimators = 300)
reg = GradientBoostingRegressor(n_estimators = 300)
#reg = XGBRegressor(n_estimators = 640, silent = True)
reg.fit(X, y)
y_pred = reg.predict(np.array(X_future))
predictions = predictions.append(pd.Series(y_pred), ignore_index = True,)
predictions.columns = [2020, 2021, 2022]
predictions['Row-wise Duplicates'] = (predictions[2021] == predictions[2022])
predictions
That produces results such as:
2020 | 2021 | 2022 | Row-wise Duplicates |
---|---|---|---|
13211.008045 | 29624.483861 | 34110.523735 | False |
13211.008045 | 29624.483861 | 33462.196606 | False |
13211.008045 | 29624.483861 | 33867.781932 | False |
13211.008045 | 29624.483861 | 33999.203849 | False |
13211.008045 | 29624.483861 | 33947.950436 | False |
13211.008045 | 29624.483861 | 33550.338744 | False |
13211.008045 | 29624.483861 | 34079.297200 | False |
13211.008045 | 29624.483861 | 33924.349324 | False |
13211.008045 | 29624.483861 | 33195.847833 | False |
13211.008045 | 29624.483861 | 33922.391200 | False |
As you can see, despite fitting anew on each iteration, I'm seeing a lot of repeat values.
I also sometimes see duplication of values across the years (usually 2021 matching 2022, which is why I calculate the Row-wise Duplicates column):
2020 | 2021 | 2022 | Row-wise Duplicates |
---|---|---|---|
40819.929316 | 40819.929316 | 40819.929316 | True |
41516.312213 | 41516.312213 | 41516.312213 | True |
41516.312213 | 41516.312213 | 41516.312213 | True |
40901.743937 | 40901.743937 | 40901.743937 | True |
41191.025907 | 41191.025907 | 41191.025907 | True |
41109.211286 | 41109.211286 | 41109.211286 | True |
40910.834451 | 40910.834451 | 40910.834451 | True |
41799.581630 | 41799.581630 | 41799.581630 | True |
42512.531092 | 42512.531092 | 42512.531092 | True |
41018.306151 | 41018.306151 | 41018.306151 | True |
What am I doing wrong? Why am I seeing duplicates like this? And how can I fix it?