1

I want to do multioutput prediction of labels and continuous data. My data consists of time series, one 10 time-points series of 30 observables per sample. I want to predict 10 labels that are binary, and 5 that are continuous, based on this data.

For the sake of simplicity I have flattened the time series data - ending up with one row per sample.

Since there are many labels to predict about the same system, and since there exists relationships between these, I want to use MutliOutputPrediction to do so. My idea is to divide the task into two parts; one for MultiOutputClassification, another for MultiOutputRegression.

I generally like XGBoost and wish to use it for this task, but of course I want to prevent overfitting when doing so. So I have a piece of code as follows, and I wish to pass the early_stopping_rounds to the fit method of the XGBClassifier, but don't know how to.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)


pipeline = Pipeline([
    ('imputer', SimpleImputer()),  # XGBoost can deal with NaNs, but MultiOutputClassifier cannot
    ('classifier', MultiOutputClassifier(XGBClassifier()))
    ])


param_grid = dict(
    classifier__estimator__n_estimators=[100],  # this works
    # classifier__estimator__early_stopping_rounds=[30],  # needs to be passed to .fit
    # classifier__estimator__scale_pos_weight=[scale_pos_weight],  # XGBoostError: Invalid Parameter format for scale_pos_weight expect float            
    )

clf = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='roc_auc', refit='roc_auc', cv=5, n_jobs=-1)
clf.fit(X_train, y_train[CLASSIFICATION_LABELS]) 
y_hat_proba = np.array(clf.predict_proba(X_test))
y_hat = pd.DataFrame(np.array([y_hat_proba[:, i, 0] for i in range(y_hat_proba.shape[1])]), columns=CLASSIFICATION_LABELS)

auc_roc_scores = np.array([roc_auc_score(y_test[label], (y_hat[label] > 0.5).astype(int)) for label in y_hat.columns])
print(f'average ROC AUC score: {np.mean(auc_roc_scores).round(3)}+/-{np.std(auc_roc_scores).round(3)}')


>>> average ROC AUC score: 0.499+/-0.002

I tried passing it to fit as follows: classifier__estimator__early_stopping_rounds=30 classifier__early_stopping_rounds=30

I get AUC ROC scores of 0.5 on the labels, which means this clearly isn't working and hence why I want to pass the early_stopping_rounds parameter and the eval_set. I suppose that being able to pass scale_pos_weight could also be useful, but probably doesn't work for MultiOutput prediction. At the moment I get the feeling that this is not the way to go to solve this, and in case you agree I would appreciate alternative suggestions.

0 Answers0