I have a highly unbalanced dataset and am wondering where to account for the weights, and thus am trying to comprehend the difference between scale_pos_weight
argument in XGBClassifier
and the sample_weight
parameter of the fit
method. Would appreciate an intuitive explanation of the difference between the two, if they can be used simultaneously or how either approach is selected.
The documentation indicates that scale_pos_weight
:
control the balance of positive and negative weights..& typical value to consider: sum(negative cases) / sum(positive cases)
Example:
from xgboost import XGBClassifier
import xgboost as xgb
LR=0.1
NumTrees=1000
xgbmodel=XGBClassifier(booster='gbtree',seed=0,nthread=-1,
gamma=0,scale_pos_weight=14,learning_rate=LR,n_estimators=NumTrees,
max_depth=5,objective='binary:logistic',subsample=1)
xgbmodel.fit(X_train, y_train)
OR
from xgboost import XGBClassifier
import xgboost as xgb
LR=0.1
NumTrees=1000
xgbmodel=XGBClassifier(booster='gbtree',seed=0,nthread=-1,
gamma=0,learning_rate=LR,n_estimators=NumTrees,
max_depth=5,objective='binary:logistic',subsample=1)
xgbmodel.fit(X_train, y_train,sample_weight=weights_train)