15

I'm trying to set up learning to rank with lightgbm, I have the following dataset with the interactions of the users based on the query:

df = pd.DataFrame({'QueryID': [1, 1, 1, 2, 2, 2], 
                   'ItemID': [1, 2, 3, 1, 2, 3], 
                   'Position': [1, 2 , 3, 1, 2, 3], 
                   'Interaction': ['CLICK', 'VIEW', 'BOOK', 'BOOK', 'CLICK', 'VIEW']})

The question is to properly set up the dataset for training? The docs mention using Dataset.set_group() but it's not very clear how.

azro
  • 53,056
  • 7
  • 34
  • 70
Franco Piccolo
  • 6,845
  • 8
  • 34
  • 52
  • Hi, can you maybe make your question a bit clearer? is `Position` your target? Or are you trying to get a solution like e.g. Amazon uses to propose interesting stuff? That would be more like colaborative filtering. – jottbe Oct 18 '20 at 12:37

2 Answers2

26

I gave this example as answer to another question, even though it does not specifically address the original question it can still be useful I hope!

Here is how I used LightGBM LambdaRank.

First we import some libraries and define our dataset

import numpy as np
import pandas as pd
import lightgbm

df = pd.DataFrame({
    "query_id":[i for i in range(100) for j in range(10)],
    "var1":np.random.random(size=(1000,)),
    "var2":np.random.random(size=(1000,)),
    "var3":np.random.random(size=(1000,)),
    "relevance":list(np.random.permutation([0,0,0,0,0, 0,0,0,1,1]))*100
})

Here is the dataframe:

     query_id      var1      var2      var3  relevance
0           0  0.624776  0.191463  0.598358          0
1           0  0.258280  0.658307  0.148386          0
2           0  0.893683  0.059482  0.340426          0
3           0  0.879514  0.526022  0.712648          1
4           0  0.188580  0.279471  0.062942          0
..        ...       ...       ...       ...        ...
995        99  0.509672  0.552873  0.166913          0
996        99  0.244307  0.356738  0.925570          0
997        99  0.827925  0.827747  0.695029          1
998        99  0.476761  0.390823  0.670150          0
999        99  0.241392  0.944994  0.671594          0

[1000 rows x 5 columns]

The structure of this dataset is important. In learning to rank tasks, you probably work with a set of queries. Here I define a dataset of 1000 rows, with 100 queries, each of 10 rows. These queries could also be of variable length.

Now for each query, we have some variables and we also get a relevance. I used numbers 0 and 1 here, so this is basically the task that for each query (set of 10 rows), I want to create a model that assigns higher relevance to the 2 rows that have a 1 for relevance.

Anyway, we continue with the setup for LightGBM. I split the dataset into a training set and validation set, but you can do whatever you want. I would recommend using at least 1 validation set during training.

train_df = df[:800]  # first 80%
validation_df = df[800:]  # remaining 20%

qids_train = train_df.groupby("query_id")["query_id"].count().to_numpy()
X_train = train_df.drop(["query_id", "relevance"], axis=1)
y_train = train_df["relevance"]

qids_validation = validation_df.groupby("query_id")["query_id"].count().to_numpy()
X_validation = validation_df.drop(["query_id", "relevance"], axis=1)
y_validation = validation_df["relevance"]

Now this is probably the thing you were stuck at. We create these 3 vectors/matrices for each dataframe. The X_train is the collection of your indepedent variables, so the input data for your model. y_train is your dependent variable, what you are trying to predict/rank. Lastly, qids_train are you query ids. They look like this:

array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])

Also this is X_train:

         var1      var2      var3
0    0.624776  0.191463  0.598358
1    0.258280  0.658307  0.148386
2    0.893683  0.059482  0.340426
3    0.879514  0.526022  0.712648
4    0.188580  0.279471  0.062942
..        ...       ...       ...
795  0.014315  0.302233  0.255395
796  0.247962  0.871073  0.838955
797  0.605306  0.396659  0.940086
798  0.904734  0.623580  0.577026
799  0.745451  0.951092  0.861373

[800 rows x 3 columns]

and this is y_train:

0      0
1      0
2      0
3      1
4      0
      ..
795    0
796    0
797    1
798    0
799    0
Name: relevance, Length: 800, dtype: int64

Note that both of them are pandas dataframes, LightGBM supports them, however numpy arrays would also work.

As you can see they indicate the length of each query. If your queries would be of variable lenght, then the numbers in this list would also be different. In my example, all queries are the same length.

We do the exact same thing for the validation set, and then we are ready to start the LightGBM model setup and training. I use the SKlearn API since I am familiar with that one.

model = lightgbm.LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
)

I only use the very minimum amount of parameters here. Feel free to take a look ath the LightGBM documentation and use more parameters, it is a very powerful library. To start the training process, we call the fit function on the model. Here we specify that we want NDCG@10, and want the function to print the results every 10th iteration.

model.fit(
    X=X_train,
    y=y_train,
    group=qids_train,
    eval_set=[(X_validation, y_validation)],
    eval_group=[qids_validation],
    eval_at=10,
    verbose=10,
)

which starts the training and prints:

[10]    valid_0's ndcg@10: 0.562929
[20]    valid_0's ndcg@10: 0.55375
[30]    valid_0's ndcg@10: 0.538355
[40]    valid_0's ndcg@10: 0.548532
[50]    valid_0's ndcg@10: 0.549039
[60]    valid_0's ndcg@10: 0.546288
[70]    valid_0's ndcg@10: 0.547836
[80]    valid_0's ndcg@10: 0.552541
[90]    valid_0's ndcg@10: 0.551994
[100]   valid_0's ndcg@10: 0.542401

I hope I could sufficiently illustrate the process with this simple example. Let me know if you have any questions left.

charelf
  • 3,103
  • 4
  • 29
  • 51
  • You have assigned value '1' to two items while preparing labels. Does it mean that both the items with label '1' are equally relevant to the query? In that case, I believe raking either query at top followed by another one will have same ndcg score. Am I thinking in the right direction? – Bhargav Upadhyay Aug 30 '21 at 06:41
  • Yes, this is also the way I understood and implemented it. If one is more important you can give it the value 2 then LGBM will try to rank it higher. Let me know if you have any other questions! – charelf Aug 31 '21 at 07:09
  • How to rank, when we have query_id and document_id? If I replace `query_id` with `[query_id, document_id]` in your code, getting printing 1 as the score for the 10 groups after `.fit()` like `[10] valid_0's ndcg@10: 1` – shaik moeed Sep 29 '21 at 12:20
  • Sorry I dont think I understand your question. What is ```document_id```? – charelf Sep 29 '21 at 17:21
  • Can I ask for practical example how I could modify your idea and what does query_id means? I'm struggling to understand how I can implement LightGBM to my example where I have: 1. user_id, 2. item_id (or let's say document_id like the user above asked), 3. multiple cosine similarities of various models. Let's say I let the users decide relevancy of a search results (by click through, button, whatever). What options do I have now? What can my query_id be? – Banik Jun 06 '22 at 18:43
  • Do I need to average out the similarities and put all documents to X axis or I can use similarities for X axis too somehow? – Banik Jun 06 '22 at 18:55
  • 1
    I think query_id is the same as what you call user_id (assuming your task is that a user searches something, and the search gives you a few items (each having an item_id) and you want to rank these items?) – charelf Jun 06 '22 at 19:57
  • 1
    I dont know exactly what cosine similarities are, but basically the way this example works is that the user searches for something, and all of the items have different features, e.g. in your case, all the items have a different cosine similarity. If that is your situation, then you put into the X all of the things that you know, like all your data, and in the y you give the relevance of the items, e.g. the item that is most relevant to the user should have the highest y-value among all items associated to a given user_id – charelf Jun 06 '22 at 20:00
  • 1
    @charelf Thank you for your answer. This sounds good. Gonna try to experiment with it tomorrow. I also had an idea whether query_id shouldn't be for me a rank number (position in search results / "k"). Gonna experiment with it. I guess I need to go into the unknown little bit and see what will happen. – Banik Jun 06 '22 at 20:40
  • @charelf And to clarify myself. I have N results sorted by cosine similarity of vectors. Those vectors are made by transfering text into vectors (bag-of-words etc.) by different methods (Tf-Idf, Doc2Vec etc.) Each of those methods gives different cosine similarity and different result. Some of them are better for one item (document) but worse for another, thus I would like to (ideally) boost results according to users relevance choices and kind of blend together various models. I hope it will go well. – Banik Jun 06 '22 at 20:47
  • 1
    Yes in that case, what you want to do is for each user_id, rank the items such that those items which are most relevant to this particular user (based on their preferences) have the highest value in the y vector. – charelf Jun 07 '22 at 09:44
  • ~How come the metric of your choice (NDCG@10) is growing worse during training?~ You're using random independent data, so that's overfitting, no problem. – Grisha Feb 22 '23 at 12:23
  • Since I am using random data, i think the output is just random nonsense, so in this instance its getting worse by chance, but i think (not sure) that if you would rerun it, it could also be getting better. – charelf Feb 24 '23 at 08:25
1

Before converting this data to a group. You have to create a score variable i.e. dependent variable and then generate a train and test file. On the top of it, you need to create two group files for both train and test(Which is looking for the number of times same qid i.e. QueryID is been used.)

Go through this article for more references: https://medium.com/@tacucumides/learning-to-rank-with-lightgbm-code-example-in-python-843bd7b44574

Rajan Garg
  • 11
  • 1
  • 2
    Hi! Unfortunately, the article you shared doesn't provide more information on how the dataset should look like. – LKho Mar 08 '21 at 15:27