1

I am trying to map 13-dimensional input data to 3-dimensional output data by using RandomForest and GradientBoostingRegressor of scikit-learn. While for the RandomForest regressor this works fine, I get a ValueError for the GradientBoostingRegressor stating ValueError: y should be a 1d array, got an array of shape (16127, 3) instead.

I don't really understand why I get this error when using GradientBoostingRegressor and not when using the RandomForestRegressor. As far as I understand, both of them use decision trees as a weak learner and combine them to get a good result. Of course I know that I could transform the 3-dimensional output-labels to a 1-dimensional array but this does not make sense as i want to map to a 3-dimensional output-vector. Any idea how I can do this using the GradientBoostingRegressor?

Here is my code:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Read data from  csv files
Input_data_features = pd.read_csv("C:/Users/wi9632/Desktop/TestData_InputFeatures.csv", sep=';')
Input_data_labels = pd.read_csv("C:/Users/wi9632/Desktop/TestData_OutputLabels.csv", sep=';')
Input_data_features = Input_data_features.values
Input_data_labels = Input_data_labels.values


# standardize input features X and output labels Y
scaler_standardized_X = StandardScaler()
Input_data_features = scaler_standardized_X.fit_transform(Input_data_features)

scaler_standardized_Y = StandardScaler()
Input_data_labels = scaler_standardized_Y.fit_transform(Input_data_labels)


# Split dataset into train, validation, an test
index_X_Train_End = int(0.7 * len(Input_data_features))
index_X_Validation_End = int(0.9 * len(Input_data_features))

X_train = Input_data_features[0: index_X_Train_End]
X_valid = Input_data_features[index_X_Train_End: index_X_Validation_End]
X_test = Input_data_features[index_X_Validation_End:]

Y_train = Input_data_labels[0: index_X_Train_End]
Y_valid = Input_data_labels[index_X_Train_End: index_X_Validation_End]
Y_test = Input_data_labels[index_X_Validation_End:]


#Define a random forest model and train it
model_randomForest = RandomForestRegressor( )
model_randomForest.fit(X_train, Y_train)

#Predict the test data with Random Forest
Y_pred_randomForest = model_randomForest.predict(X_test)
print(f"Random Forest Prediction: {Y_pred_randomForest}")


#Define a gradient boosting  model and train it (-->Here I get the ValueError)
model_gradientBoosting = GradientBoostingRegressor( )
model_gradientBoosting.fit(X_train, Y_train)

#Predict the test data with Random Forest
Y_pred_gradientBoosting = model_gradientBoosting.predict(X_test)
print(f"Gradient Boosting Prediction: {Y_pred_gradientBoosting}")

Here is the test data: https://filetransfer.io/data-package/ABCrGPzt#link

Reminder: As I could not solve my problem, I would like to remind you on this question. Does anybody have an idea how to tackle this problem?

PeterBe
  • 700
  • 1
  • 17
  • 37
  • What is your cost function? Is it a multi-objetive optimization setting? – Learning is a mess Jun 08 '22 at 10:03
  • @Learningisamess: Thanks "Learning is a mess" for your comment. Actually I use the default cost-function which is the squarred error (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html). So it should be the difference between predicted and actual output. – PeterBe Jun 08 '22 at 10:17
  • Square mean error is meant for 1d target space. Do you want to use the Euclidean distance in 3d? – Learning is a mess Jun 08 '22 at 10:26
  • @Learningisamess: Thanks for your comment. Actually, I would like to use the mean squarred error (with euclidean distance) in 3d. But what I don't understand why for the RandomForestRegressor I don't get this error. Here I also use the default error which is likewise the squarred error (see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) – PeterBe Jun 08 '22 at 10:29
  • @Learningisamess: Thanks for your comments. Any comments to my last comment? I'll highyl appreciate every further comment from you. – PeterBe Jun 09 '22 at 12:11
  • Honesty I have never encountered a similar problem. Top of my head idea would be to build three regressor, one for each coordinate, each minimizing the square error, so that the sum of the errors will match the euclidean distance. Sorry I cannot be of more help! – Learning is a mess Jun 09 '22 at 13:30
  • 1
    @Learningisamess: Thanks for your comment and effort. I really appreciate it. The problem is quite strange. As pointed out I don't understand at all why this problem does not occur when using RandomForest. But I understand that you can't help me anymore. Hopefully someone else can answer this question. – PeterBe Jun 09 '22 at 14:57

1 Answers1

1

RandomForestRegressor supports multi output regression, see docs. GradientBoostingRegressor does not.

You can use MultiOutputRegressor + GradientBoostingRegressor for the problem. See this answer.

from sklearn.multioutput import MultiOutputRegressor
params = {'n_estimators': 5000, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 2}

estimator = MultiOutputRegressor(ensemble.GradientBoostingRegressor(**params))
estimator.fit(train_data,train_targets)
phi
  • 10,572
  • 3
  • 21
  • 30