How to evaluate predictions from incomplete data, where not all data is incomplete

Question

I am using Non-negative Matrix Factorization and Non-negative Least Squares for predictions, and I want to evaluate how good the predictions are depending on the amount of data given. For example the original Data was

original = [1, 1, 0, 1, 1, 0]

And now I want to see how good I can reconstruct the original data when the given data is incomplete:

incomplete1 = [1, 1, 0, 1, 0, 0],
incomplete2 = [1, 1, 0, 0, 0, 0],
incomplete3 = [1, 0, 0, 0, 0, 0]

And I want to do this for every example in a big dataset. Now the problem is, the original data varies in the amount of positive data, in the original above there are 4, but for other examples in the dataset it could be more or less. Let´s say I make an evaluation round with 4 positives given, but half of my dataset only has 4 positives, the other half has 5,6 or 7. Should I exclude the half with 4 positives, because they have no data missing which makes the "prediction" much better? On the other side I would change the trainingset if I excluded data. What can I do? Or shouldn´t I evaluate with 4 at all in this case?

EDIT:

Basically I want to see how good I can reconstruct the input matrix. For simplicity, say the "original" stands for a user who watched 4 movies. And then I want to know how good I can predict each user, based on just 1 movie that the user acually watched. I get a prediction for lots of movies. Then I plot a ROC and Precision-Recall curve (using top-k of the prediction). And I will repeat all of this with n movies that the users actually watched. I will get a ROC curve in my plot for every n. When I come to the point where I use e.g. 4 movies that the user actually watched, to predict all movies he watched, but he only watched those 4, the results get too good.

The reason why I am doing this is to see how many "watched movies" my system needs to make reasonable predictions. If it would return only good results when there are already 3 movies watched, It would not be so good in my application.

score 1 · Accepted Answer · answered Sep 01 '12 at 14:17

I think it's first important to be clear what you are trying to measure, and what your input is.

Are you really measuring ability to reconstruct the input matrix? In collaborative filtering, the input matrix itself is, by nature, very incomplete. The whole job of the recommender is to fill in some blanks. If it perfectly reconstructed the input, it would give no answers. Usually, your evaluation metric is something quite different from this when using NNMF for collaborative filtering.

FWIW I am commercializing exactly this -- CF based on matrix factorization -- as Myrrix. It is based on my work in Mahout. You can read the docs about some rudimentary support for tests like Area under curve (AUC) in the product already.

Is "original" here an example of one row, perhaps for one user, in your input matrix? When you talk about half, and excluding, what training/test split are you referring to? splitting each user, or taking a subset across users? Because you seem to be talking about measuring reconstruction error, but that doesn't require excluding anything. You just multiply your matrix factors back together and see how close they are to the input. "Close" means low L2 / Frobenius norm.

But for convention recommender tests (like AUC or precision recall), which are something else entirely, you would either split your data into test/training by time (recent data is the test data) or value (most-preferred or associated items are the test data). If I understand the 0s to be missing elements of the input matrix, then they are not really "data". You wouldn't ever have a situation where the test data were all the 0s, because they're not input to begin with. The question is, which 1s are for training and which 1s are for testing.

Hi, yes I want to see how good the reconstruction is, I am going to put my answer in my post as an edit. — Puckl, Sep 01 '12 at 14:41
You don't want to measure the reconstruction error, I think. What you need is exactly what you say above in your edit -- something like a ROC curve (this is what AUC summarizes). To answer your question, just run the test as usual. But, calculate AUC separately for users who happened to have just 1 item in the train set, 2 items, 3 items, etc. You should get a clear graph of how performance changes with ratings. — Sean Owen, Sep 01 '12 at 16:20

How to evaluate predictions from incomplete data, where not all data is incomplete

1 Answers1