Is there a way to train a stacked regressor with scikit-learn such that a single final estimator is used to return multiple outputs?
I have been using sklearn.ensemble.StackingRegressor
but, as indicated in the documentation of the .fit()
method, this is only suitable if the final estimator returns an output of shape (n_samples,) while I would need (n_samples, n_features).
As a workaround, Sklearn implements sklearn.ensemble.MultiOutputRegressor
(as proposed here: Multioutput Stacking Regressor) which extends monovariate models into multivariate ones by training a separate model for each output feature. I got this solution to run, however, I am not satisfied with this as it takes too long to train on high-dimensional data and needlessly multiplies the number of parameters.
Instead, I would like to use a single final estimator (e.g. a random forest) that can take a multivariate input and return a multivariate output. This would make the prediction pipeline straightforward and, I believe, faster to train.
The following illustration represents what I would like to achieve:
+-----------------------+----+--------------+-------------------------------------------+----+-------------+----+------------------+
| Set of observations 1 | -> | Base Model 1 | | | | | |
| (nsmpls, nfeats) | | | | | | | |
+-----------------------+----+--------------+-------------------------------------------+----+-------------+----+------------------+
| | Concatenated predictions from base models | -> | Final model | -> | Final prediction |
| | (nsmpls, 2 * nfeats) | | | | (nsmpls, nfeats) |
+-----------------------+----+--------------+-------------------------------------------+----+-------------+----+------------------+
| Set of observations 2 | -> | Base Model 2 | | | | | |
| (nsmpls, nfeats) | | | | | | | |
+-----------------------+----+--------------+-------------------------------------------+----+-------------+----+------------------+
In this specific example the number of input and output features are identical for the base models, but the rational would be the same in a different setup. Also, the set of input features are different for each base model, but this is a separate issues that I solved using: How to use different feature matrices for sklearn.ensemble.StackingClassifier (with class inheritance)?
I could also directly train the final model on the concatenated inputs (as opposed to the concatenated predictions) without stacking, but I would also like to have these 2 base models trained as they are of interest on their own. The pipeline I described here would kill two birds with one stone.