0

I'm trying to develop an XGBoost Survival model. Here is a quick snap of my code:

X = df_High_School[['Gender', 'Lived_both_Parents', 'Moth_Born_in_Canada', 'Father_Born_in_Canada','Born_in_Canada','Aboriginal','Visible_Minority']]  # covariates 
y = df_High_School[['time_to_event', 'event']]  # time to event and event indicator

#split the data into training and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Develop the model 
model = xgb.XGBRegressor(objective='survival:cox')

It's giving me the following error:


ValueError Traceback (most recent call last) in 18 19 # fit the model to the training data ---> 20 model.fit(X_train, y_train) 21 22 # make predictions on the test set

2 frames /usr/local/lib/python3.8/dist-packages/xgboost/core.py in _maybe_pandas_label(label) 261 if isinstance(label, DataFrame): 262 if len(label.columns) > 1: --> 263 raise ValueError('DataFrame for label cannot have multiple columns') 264 265 label_dtypes = label.dtypes

ValueError: DataFrame for label cannot have multiple columns

As this is a survival model, I need two columns t indicate the event and the time_to_event. I also tried converting the Dataframes to Numpy but it didn't work too.

Any clue? Thanks!

molbdnilo
  • 64,751
  • 3
  • 43
  • 82
  • You are mixing a classification and a regression (one predicting a classes - the events - and one a numerical value - time). Is there a reason to believe that the event and the time to that event have a connection/relation? If not, you can use XGBoost to develop two (independent) models on the same data, one predicting the event and one the time to event. They are implicitly connected (shared data), but will each output a separate prediction that you can combine. If there are dependencies in the two outputs, they might not be captured with that. You need another (statistical) model for that. – Baradrist Jan 05 '23 at 09:55
  • I'm talking about survival analysis, which I believe is a bit different than classification or regression modeling. Survival analysis is a type of statistical analysis that is used to analyze time-to-event data, such as the time it takes for a patient to experience a certain event (e.g., death, relapse, etc.). It is different from regression and classification modeling in that it focuses specifically on predicting time-to-event data, rather than predicting a continuous or categorical outcome. Additionally, survival analysis takes into account that not all events may have occurred yet. – Mohamad Jan 06 '23 at 19:28
  • Yes, I am familiar with that. Nevertheless, your own definition of the data consists of a prediction of two types: a) a class (the categorical variable of "event") and b) a numeric value (the time-to-event). You want to predict them in conjunction, so I was suggesting that your model could be a neural net that splits its last layer into two predictions (or two independent models), adjusting the loss-function to be a [combination](https://stats.stackexchange.com/questions/245902/is-there-any-algorithm-combining-classification-and-regression) of categorical and regression loss. – Baradrist Jan 09 '23 at 07:14
  • By the way, you tried a cox regression, so it's not that far away from what you expected it in the first place! The problem is, that in cox regression you are implicitly assuming the category already (e.g. survival) and only look for the time to that event. For example, you don't consider censored events (patients dropping out). For this, one further step is needed. It might actually be, that you don't need a machine learning model in the end. Maybe, a statistical model is even better suited! – Baradrist Jan 09 '23 at 07:25
  • Thanks for your elaboration! – Mohamad Jan 12 '23 at 17:54

0 Answers0