2

I'm currently trying to build a recommendation engine using the Python LightFM library. My input is a sparse matrix with shape (n_users, n_items), where each cell value within the matrix represents the number of interactions a user has had with the particular item. This is quite different from most of the examples I've seen, where the matrix is usually boolean (1 or 0) in nature or uses a small scale (e.g. rating of 1-5), I'm not sure whether this could be a contributing factor to the problem I am facing.

Training AUC score: 0.892
Testing AUC score: 0.873

K=10,
Training precision_at_k: 0.0363
Testing precision_at_k: 0.0363

I'm very puzzled as to why the AUC score seems to indicate that the engine is performing well, yet the precision at K indicates otherwise.

My understanding of the AUC score is that it seems to be better used for binary classification tasks. Could this be why the score appears to be so high? It recognizes every positive value as a boolean true, and every 0 value as a negative. In this case, the number of positive values greatly outweigh that of the 0 values given the sparsity of the matrix.

I have relatively lesser experience with precision at k, and only know that it represents an averaged precision for the predictions. What does this low score mean, when we consider it alongside the higher AUC score though?

EDIT:

So I'm still not entirely sure on this topic but I just thought I'd share some intuition I've gathered on this topic just in case someone else encounters the same issue as me. One possible reason why the recommendation engine could be scoring low for precision_at_k is because that metric takes into consideration the ranking of the recommendations. That means it can provide 10 recommendations that the user will like, but if the recommendations are not in order of decreasing preference then the model will score very poorly.

andre
  • 41
  • 6

1 Answers1

0

If you calculate precision at k for a user and k is higher than the number of items in the test set you cannot get 100% precision.

If I compute precision at k, I generate 10 recommendations for each user in the test set. Then I see what percentage of my recommendations were in the users test set data. If the user only has 2 items in that data, then only a max of 2 of my recommendations can be correct. I cannot get more than 2/10. So my precision at k is capped at 0.2 even if I correctly recommend both the items in the test data