I'm trying to activate lightgbm with the 'map' metric (I'll explain why i do it in the end of this post), with the following parameters dict (using sklearn API):
param = {
'objective': 'binary',
'num_threads': 40,
'metric': 'map',
'eval_at': 300,
'feature_fraction': 1.0,
'bagging_fraction': 1.0,
'min_data_in_leaf': 50,
'max_depth': -1,
'subsample_for_bin': 200000,
'subsample': 1.0,
'subsample_freq': 0,
'min_split_gain': 0.0,
'min_child_weight': 0.001,
'min_child_samples': 20,
'n_estimators': 9999
}
but I get the following error:
> [LightGBM] [Fatal] For MAP metric, there should be query information
> Traceback (most recent call last): File
> "/home/danri/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py",
> line 2910, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-76-81403c753a65>", line 44, in <module>
> eval_metric=param_1['metric']) File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/sklearn.py",
> line 539, in fit
> callbacks=callbacks) File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/sklearn.py",
> line 391, in fit
> callbacks=callbacks) File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py",
> line 168, in train
> booster = Booster(params=params, train_set=train_set) File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py",
> line 1215, in __init__
> ctypes.byref(self.handle))) File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py",
> line 47, in _safe_call
> raise LightGBMError(_LIB.LGBM_GetLastError()) lightgbm.basic.LightGBMError: b'For MAP metric, there should be query
> information'
The only explanation i found for the query information concept was in lightgbm parameters docs
and this is the explanation:
Query data
For LambdaRank learning, it needs query information for training data. LightGBM use an additional file to store query data. Following is an example:
27 18 67 ...
It means first 27 lines samples belong one query and next 18 lines belong to another, and so on.(Note: data should order by query) If name of data file is “train.txt”, the query file should be named as “train.txt.query” and in same folder of training data. LightGBM will load the query file automatically if it exists.
You can specific query/group id in data file now. Please refer to parameter group in above.
I also looked into lightgbm code to find the use of it, but still did not understand the query information concept. Could someone explain it?
The reason I'm trying to use 'map' metric is that the purpose of my classification model is reaching the highest PPV on the top 10% risk. When I optimize by 'auc', any improvement in the ranking (in top risk decile or in other parts of the samples dataset) improves the AUC. I want the model to optimize only on improving top 10% PPV, because this will be its real world use (i.e. Sending top 10% risk people to a certain medical treatment).
Would love to get any help.
Thanks!