I am trying out the customized loss function for quantile regression with XGBoost from https://gist.github.com/Nikolay-Lysenko/06769d701c1d9c9acb9a66f2f9d7a6c7 which is as follows:
import numpy as np
def xgb_quantile_eval(preds, dmatrix, quantile=0.2):
"""
Customized evaluational metric that equals
to quantile regression loss (also known as
pinball loss).
Quantile regression is regression that
estimates a specified quantile of target's
distribution conditional on given features.
@type preds: numpy.ndarray
@type dmatrix: xgboost.DMatrix
@type quantile: float
@rtype: float
"""
labels = dmatrix.get_label()
return ('q{}_loss'.format(quantile),
np.nanmean((preds >= labels) * (1 - quantile) * (preds - labels) +
(preds < labels) * quantile * (labels - preds)))
def xgb_quantile_obj(preds, dmatrix, quantile=0.2):
"""
Computes first-order derivative of quantile
regression loss and a non-degenerate
substitute for second-order derivative.
Substitute is returned instead of zeros,
because XGBoost requires non-zero
second-order derivatives. See this page:
https://github.com/dmlc/xgboost/issues/1825
to see why it is possible to use this trick.
However, be sure that hyperparameter named
`max_delta_step` is small enough to satisfy:
```0.5 * max_delta_step <=
min(quantile, 1 - quantile)```.
@type preds: numpy.ndarray
@type dmatrix: xgboost.DMatrix
@type quantile: float
@rtype: tuple(numpy.ndarray)
"""
try:
assert 0 <= quantile <= 1
except AssertionError:
raise ValueError("Quantile value must be float between 0 and 1.")
labels = dmatrix.get_label()
errors = preds - labels
left_mask = errors < 0
right_mask = errors > 0
grad = -quantile * left_mask + (1 - quantile) * right_mask
hess = np.ones_like(preds)
return grad, hess
I have been getting errors when trying to fit the model (after running xgb_r.fit(train_X, train_y)
).
If I assign the variables as follows:
X = df[['var1','var2', 'var3','var4','var5']]
I get this error:
AttributeError: 'numpy.ndarray' object has no attribute 'get_label'
If the variables are assigned like this:
X = pd.DataFrame(np.c_[df['var1'], df['var2'], df['var3'], df['var4'], df['var5']], columns=['var1','var2', 'var3','var4','var5'])
Then I get this:
ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, The experimental DMatrix parameter 'enable_categorical' must be set to 'True'. Invalid columns:var1: object, var2: object, var3: object, var4: object, var5: object
In any case, df.dtypes
shows that all variables I am using are either int64 or float64. Any advice on how to fix this will be great
So maybe yet another way of assigning variables is needed.