0

Unlike the XGBoost documentation my dataset contains many observations for each patient, so when a patient health changes new measurements were logged.

I’m trying to get benefit from that abundance of data and I fit XGBoost model with all that. I do that without differentiating patients from each other.

So my dataset looks like this:

Patient A B C D E F daysOfObserv daysToEvent
x1 1 3 5 2 8 1 10 364
x1 17 9 2 4 23 1 20 211
x1 8 6 4 6 3 2 56 30
x2 3 5 5 4 13 66 13 121

I drop the Patient column. daysToEvent goes to y_training_value and then also get dropped before training.

import xgboost as xgb
import pandas as pd

df = pd.read(‘data.csv’)
y_train_lower_bound = df[‘daysToEvent’].values
y_train_upper_bound = df[‘daysToEvent’].values
df = df.drop([‘Patient’, ‘daysToevent’], axis=1)
x_train = xgb.DMatrix(df)
x_train.set_float_info(‘label_lower_bound’, y_train_lower_bound)
x_train.set_float_info(‘label_upper_bound’, y_train_upper_bound)

params = {'objective': 'survival:aft',
          'eval_metric': 'aft-nloglik',
          'aft_loss_distribution': 'normal',
          'aft_loss_distribution_scale': 1.20,
          'tree_method': 'hist', 'learning_rate': 0.05, 'max_depth': 2}
bst = xgb.train(params, x_train_xgb, num_boost_round=100, #feval=rmsle,
                evals=[(x_train_xgb, 'train')])
Output:
[0] train-aft-nloglik:15.65987
[1] train-aft-nloglik:14.39717
[2] train-aft-nloglik:13.25554
[3] train-aft-nloglik:12.22325
[4] train-aft-nloglik:11.28966
[5] train-aft-nloglik:10.44522
[6] train-aft-nloglik:9.68131
[7] train-aft-nloglik:8.99034
[8] train-aft-nloglik:8.36482
[9] train-aft-nloglik:7.79868
[10]    train-aft-nloglik:7.28619
[11]    train-aft-nloglik:6.82221
[12]    train-aft-nloglik:6.40206
[13]    train-aft-nloglik:6.02054
[14]    train-aft-nloglik:5.67580
[15]    train-aft-nloglik:5.36259
[16]    train-aft-nloglik:5.07880
[17]    train-aft-nloglik:4.82224
[18]    train-aft-nloglik:4.58903
[19]    train-aft-nloglik:4.37760
[20]    train-aft-nloglik:4.18603
[21]    train-aft-nloglik:4.01213
[22]    train-aft-nloglik:3.85469
[23]    train-aft-nloglik:3.71190
[24]    train-aft-nloglik:3.58228
...
[96]    train-aft-nloglik:2.27551
[97]    train-aft-nloglik:2.27501
[98]    train-aft-nloglik:2.27399
[99]    train-aft-nloglik:2.27356

But later, at the prediction stage I don’t get the satisfied results even on the train dataset. Is it because of my approach of fitting multiple lines for each patient? What would be the correct approach - only one observation per the patient in XGBoost? Do I need to look for another model to fit multiple observations for one object?

user164863
  • 580
  • 1
  • 12
  • 29

1 Answers1

1

First, your data doesn't seem to meet the IID (independent identically distributed) criteria required for training. In the case of your dataset, this is due to multiple (dependent) training records for each patient exist. So you could combine each patient into one row with the data somehow aggregated (if this is possible for your data types). Another, possible better and more powerful, alternative is to use an other model type: RNNs (recurrent neural networks) and explicitly LSTM (long-short term memory). Using these types of models, you can feed MULTIPLE rows into the network (one after an other) and then get your regression prediction. These models are generally used for time series which seems to be the case in your dataset. Hopes this helps you and good luck (as my mathematics professor once said: machine learning is experimental mathematics)!

Edit: For training with LSTMs you would need to combine each patient and feed the whole time series (every single measurement point) into the network.

Fatorice
  • 515
  • 3
  • 12
  • Thank you for your detailed answer. If there are about 1 000 raws of observation for each patient, do you think talking a single random row (or let’s say 10 rows) for each one would handle the independency of distribution? – user164863 Oct 09 '22 at 11:08
  • 1
    The thing is: such rows will never be independent, cause they are dependent on the patient who generated the measurement. Maybe you can come over to this by shuffling your dataset and randomly generating train and test sets, but this is just a guess and should always be done with machine learning. – Fatorice Oct 09 '22 at 11:38