What is lightgbm's query information concept (for map metric)?

Question

I'm trying to activate lightgbm with the 'map' metric (I'll explain why i do it in the end of this post), with the following parameters dict (using sklearn API):

param = {

    'objective': 'binary',
    'num_threads': 40,
    'metric': 'map',
    'eval_at': 300,
    'feature_fraction': 1.0,
    'bagging_fraction': 1.0,
    'min_data_in_leaf': 50,
    'max_depth': -1,
    'subsample_for_bin': 200000,
    'subsample': 1.0,
    'subsample_freq': 0,
    'min_split_gain': 0.0,
    'min_child_weight': 0.001,
    'min_child_samples': 20,
    'n_estimators': 9999
}

but I get the following error:

> [LightGBM] [Fatal] For MAP metric, there should be query information
> Traceback (most recent call last):   File
> "/home/danri/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py",
> line 2910, in run_code
>     exec(code_obj, self.user_global_ns, self.user_ns)   File "<ipython-input-76-81403c753a65>", line 44, in <module>
>     eval_metric=param_1['metric'])   File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/sklearn.py",
> line 539, in fit
>     callbacks=callbacks)   File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/sklearn.py",
> line 391, in fit
>     callbacks=callbacks)   File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py",
> line 168, in train
>     booster = Booster(params=params, train_set=train_set)   File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py",
> line 1215, in __init__
>     ctypes.byref(self.handle)))   File "/home/danri/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py",
> line 47, in _safe_call
>     raise LightGBMError(_LIB.LGBM_GetLastError()) lightgbm.basic.LightGBMError: b'For MAP metric, there should be query
> information'

The only explanation i found for the query information concept was in lightgbm parameters docs

and this is the explanation:

Query data

For LambdaRank learning, it needs query information for training data. LightGBM use an additional file to store query data. Following is an example:

27 18 67 ...

It means first 27 lines samples belong one query and next 18 lines belong to another, and so on.(Note: data should order by query) If name of data file is “train.txt”, the query file should be named as “train.txt.query” and in same folder of training data. LightGBM will load the query file automatically if it exists.

You can specific query/group id in data file now. Please refer to parameter group in above.

I also looked into lightgbm code to find the use of it, but still did not understand the query information concept. Could someone explain it?

The reason I'm trying to use 'map' metric is that the purpose of my classification model is reaching the highest PPV on the top 10% risk. When I optimize by 'auc', any improvement in the ranking (in top risk decile or in other parts of the samples dataset) improves the AUC. I want the model to optimize only on improving top 10% PPV, because this will be its real world use (i.e. Sending top 10% risk people to a certain medical treatment).

Would love to get any help.

Thanks!

score 1 · Answer 1 · answered Nov 02 '18 at 07:13

There are a few things:

metric is used for evaluation only and not for optimisation (other than post-fit choice of the best hyper parameters or early stopping)
the "query" (or "group") is basically the way to tell the model how samples are groupped. For evaluation (if you use map metric only and do not use a ranking loss function) one can provide the groups via the eval_group argument of the fit method, see here. This is a list of arrays. The list has the same length as eval_set and the individual arrays contain the numbers of elements in each group. Thus the sum of integers in the array should math the number of samples in the corresponding evaluation set. Note, that in order for this grouping highlighting to work, the code assumes, that groups are coming in a sequence. For example, eval_group=[(2,3)] will mean that the metric evaluation will expect to have evaluation sample of length 5 (=2+3) and the first 2 elements are expected to belong to one group, while the follwoing 3 to another group.

score 0 · Answer 2 · answered Nov 06 '18 at 14:19

Have you thought about using another metric? From what I understand of your problem, the lift at the first decile might be a good evaluation metric. It compares the ability of your model to find "risky people" to a random guess, among the n% highest probabilities.

In short, you take the top decile of samples ranked by probability (as predicted by your model, these are the "most risky people"), count the number of actual "1" in them, and divide it by the number of actual "1" given by random predictions.

Here is how you can implement it in lightgbm. You will need to give it as "f_eval" param. It will not be used for Optimization, only for Evaluation (and early stopping). This code doesn't handle the case of equal predictions.

def f_eval_lift(pred, train_data, centile=10):
    df = pd.DataFrame({'true': train_data.get_label(), 'pred': pred})
    centile_num = int(np.ceil(centile / 100 * df.shape[0]))

    num_1 = int(df['true'].sum())
    df = df.nlargest(centile_num, columns='pred', keep='last')
    # TODO : handle the case of equal predictions

    lift_value = df['true'].sum() / (centile / 100 * num_1)
    return 'lift_' + str(centile), lift_value, True

What is lightgbm's query information concept (for map metric)?

2 Answers2

Linked