0

The problem is :

A set of 5 independent users where asked to rate 50 products given to them. All 50 products would have been used by the users in some point of time. Some users have more bias towards certain products. One user did not truly complete the survey and gave random values. It is not necessary for the users to rate all the products. Now given a 4 sample dataset , rank the products based on ratings

datset :
product #user1 #user2 #user3 #user4 #user5
 0        29    -        10   90     12 
 1         -    -         -    -      7
 2         -    -        95    6      1
 3         -    -         -    -      2
 4         -    -         -    -     50
 5         -    35       21    13     -
 6         -     -        -     -     5
 7         4     -        -    30     -
 8        11     -        -     -    14  
 .
 .
 .

How to come out with a ranking for the products.

This is a remodeled problem very close to the original problem.

Solution: I tried to clean the data and fill missing values using PCA and apply NMF but i'm not sure about the solution .

Any help will be deeply appreciated

Sandesh
  • 1,190
  • 3
  • 23
  • 41
Vinith
  • 105
  • 6
  • did you tried simple ideas to see how it behaved? Like the mean for all users for each product (taking out the missing values), or filling the missing values with the mean rating of each subject, and then performing the the mean for all users for each product (using both real and virtual values) – ASantosRibeiro Nov 14 '14 at 09:26
  • @ASantosRibeiro : Thank you !! I didn't try that. I have random rating given by a user. Hence my assumption was that, taking average will not give good results. – Vinith Nov 14 '14 at 10:20
  • random results are no more than noise in your system. as such if you have enough subjects that should not be a problem. further if you know which subjects rated randomly exclude them from the study, as their contribution will only make your results worse. – ASantosRibeiro Nov 14 '14 at 10:22
  • @ASantosRibeiro : I don't know which user gave random rating, is there any method to detect such outliers, even when many values are missing . – Vinith Nov 14 '14 at 10:25
  • you can try to get the mean for each product as explained above and get the distance between each subject rating and the mean, sum the errors for every product and see the subjects that present the highest errors. – ASantosRibeiro Nov 14 '14 at 10:31
  • @ASantosRibeiro : Thank you .. !! I tried the above said method, and it gave some good results . – Vinith Nov 14 '14 at 20:10

2 Answers2

3

If you don't care about the absolute scores and are mostly interested in a consistent relative ranking, you can view your problem as an instance of the rank aggregation problem: given a list of (partial or total) rankings, derive a consensus ranking that minimizes the total disagreement with the input rankings. There are several possible ways to formalize disagreement, and to postulate reasonable conditions that should hold. One example of such a condition is the Condorcet criterion: If an item defeats every other item in simple pairwise majority voting, then it should rank first.

This excellent paper contains a good motivation and literature review of consensus ranking approaches. Kemeny optimal aggregation minimizes Kendall-Tau distance, i.e., the total count of pairwise disagreements between lists. While this optimal aggregation is NP-hard, the authors propose reasonable heuristic approaches.

stefan.schroedl
  • 866
  • 9
  • 19
0

In this case, two imputation methods can be used:

  • As everyone would try at first, fill with the most likely value i.e. average mean.
  • Predict based on other attributes which is called imputation by regression.

Actually, I think the second method seems better for this dataset where users mostly rank more than one product.

Also, if you have another datasets depending on users, you may use it too for prediction of the missing values in this dataset.

Gökhan Çoban
  • 600
  • 8
  • 17