24

EDIT: as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.

Eg,

1,7,4,5,8,9
1,7,5,4,9,6

What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?

I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system. Does anyone know of any measures that can help me find the similarity between these two lists.

user1221572
  • 333
  • 1
  • 3
  • 8
  • Are you assuming document returned by "state of the art systems" are good? Or do you want to test if your system is better then the "state of the art"? If the second: what is your judge? how do you evaluate a query is indeed relevant? – amit Feb 20 '12 at 17:08
  • @amit: I am assuming docs returned by state of art system are good. I want to compute how similar my results are to it assuming order is very important – user1221572 Feb 20 '12 at 17:14
  • @amit: why did u delete your answer? – user1221572 Feb 20 '12 at 17:22
  • I think it does not fit your need, and currently working on an improvement – amit Feb 20 '12 at 17:24
  • I think you should rephrase your question because everybody is coming to the same conclusion of comparing relevance after reading your question and everytime you are asking people to re-read your question which means there is something wrong with the qustion. So please elaborate a bit. – Yavar Feb 20 '12 at 17:31

7 Answers7

15

The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.

It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.

Using DCG/nDCG to evaluate the system compared to the SOA base line:

Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.

Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)

To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.

If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.

Example:

mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]

First you give score to each document, according to the state of the art system [using the formula from above]:

doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222

Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783

Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939

amit
  • 175,853
  • 27
  • 231
  • 333
  • @user1221572: Look at my edit, you can use `nDCG(your_system)/nDCG(state_of_the_art_system)` to determine how much the systems are similar. Note: it is important that relevance will not be binary in this evaluation. – amit Feb 20 '12 at 17:29
  • okay. pls give me an example. I have two list 1,2,5,4,6 , 7 (my system) and 1,2,4,5,6,9 (state of the art). What will measure of similarity be – user1221572 Feb 20 '12 at 17:33
  • @user1221572: I added an example, have a look. – amit Feb 20 '12 at 17:48
  • I am not sure if using ndcg will be a good idea because ndcg(state_of_the_art) will always be 1. Thus similarity will just boil down to ndcg(your_system). Thus, should i just use DCG(my system)/dcg(soa) to calculate similarity? will this have any drawbacks – user1221572 Feb 21 '12 at 05:20
  • @user1221572: The metric I suggested is actually a variation of nDCG itself. I don't think there will be any drawbacks using it. – amit Feb 21 '12 at 09:11
  • @amit: 1 last question before I accept your answer. What value of similarity is good. In other words, what value of similarity means that my system is very similar to the soa. Do you have a reference paper that can give me this value – Programmer Feb 25 '12 at 08:07
  • @Programmer: You should make some experiments on a few algorithms to decide which value is good. It should be as closest as possible to 1, but I have no idea what "close" should be. – amit Feb 25 '12 at 08:09
  • @amit: Have u read any papers that do a similar thing that you have suggested? – Programmer Feb 25 '12 at 16:26
  • @Programmer: Not one that I am aware of, it is just a variation of NDCG I thaught about on the fly, but it is extremely similar to the original NDCG – amit Feb 25 '12 at 16:34
  • @So you mean you are just guessing that this should work and it is not a standard? – Programmer Feb 26 '12 at 06:37
  • @amit: Can we carry out a short discussion in chat if you are around now? – Programmer Feb 27 '12 at 13:27
  • @amit: Cant we just use Kendall's Tau to answer my question?? – Programmer Feb 27 '12 at 15:10
5

Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.

James
  • 51
  • 1
  • 1
  • The question mentioned "Please note that the order of documents is important as we are talking about a ranked system". Both Kendalls tau and Spearman's foot rule don't take the order into account. – M1L0U Mar 14 '13 at 17:39
  • @M1L0U Uh, both of those metrics are specifically designed to take order, or rank, into account. https://en.wikipedia.org/wiki/Rank_correlation They are exactly what OP needs. – ovolve Jul 25 '15 at 20:28
  • Oh yeah sorry, I meant that they do not weight the error by the true rank of the item. That is you pay as much if you have a flip in the top of the rank or in the bottom of the rank, unlike DCG or NDCG. – M1L0U Jul 26 '15 at 15:36
2

I actually know four different measures for that purpose.

Three have already been mentioned:

  • NDCG
  • Kendall's Tau
  • Spearman's Rho

But if you have more than two ranks that have to be compared, use Kendall's W.

Carsten
  • 180
  • 1
  • 9
2

In addition to what has already been said, I would like to point you to the following excellent paper: W. Webber et al, A Similarity Measure for Indefinite Rankings (2010). Besides containing a good review of existing measures (such as above-mentioned Kendall Tau and Spearman's footrule), the authors propose an intuitively appealing probabilistic measure that is applicable for varying length of result lists and when not all items occur in both lists. Roughly speaking, it is parameterized by a "persistence" probability p that a user scans item k+1 after having inspected item k (rather than abandoning). Rank-Biased Overlap (RBO) is the expected overlap ratio of results at the point the user stops reading.

The implementation of RBO is slightly more involved; you can take a peek at an implementation in Apache Pig here.

Another simple measure is cosine similarity, the cosine between two vectors with dimensions corresponding to items, and inverse ranks as weights. However, it doesn't handle items gracefully that only occur in one of the lists (see the implementation in the link above).

  1. For each item i in list 1, let h_1(i) = 1/rank_1(i). For each item i in list 2 not occurring in list 1, let h_1(i) = 0. Do the same for h_2 with respect to list 2.
  2. Compute v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
  3. Return v12 / sqrt(v11 * v22)

For your example, this gives a value of 0.7252747.

Please let me give you some practical advice beyond your immediate question. Unless your 'production system' baseline is perfect (or we are dealing with a gold set), it is almost always better to compare a quality measure (such as above-mentioned nDCG) rather than similarity; a new ranking will be sometimes better, sometimes worse than the baseline, and you want to know if the former case happens more often than the latter. Secondly, similarity measures are not trivial to interpret on an absolute scale. For example, if you get a similarity score of say 0.72, does this mean it is really similar or significantly different? Similarity measures are more helpful in saying that e.g. a new ranking method 1 is closer to production than another new ranking method 2.

stefan.schroedl
  • 866
  • 9
  • 19
2

Is the list of documents exhaustive? That is, is every document rank ordered by system 1 also rank ordered by system 2? If so a Spearman's rho may serve your purposes. When they don't share the same documents, the big question is how to interpret that result. I don't think there is a measurement that answers that question, although there may be some that implement an implicit answer to it.

russellpierce
  • 4,583
  • 2
  • 32
  • 44
  • Per the example OP gave in the comment to amit, the method I mentioned, (much more statistical than comp-sci) is (rho) = 0.943. – russellpierce Feb 20 '12 at 17:40
  • as u can see the lists are not exhaustive. does ur method still work – user1221572 Feb 20 '12 at 17:44
  • 1
    It still works... rho uses pairs of order and tells you about the relationship between those rank orders. – russellpierce Feb 20 '12 at 17:46
  • 1
    However, the broader question I posted in my answer holds. How should one interpret the presence of a document in the state of the art system but its absence in your system? Is it the same as saying your system ranked it below some minimum threshold? If so, the value for any comparison metric is being inflated by only considering cases where the rank orders are in a similar range and ignoring those where there is a big disagreement between the systems. – russellpierce Feb 20 '12 at 17:48
  • 1
    Conceptually, amount of agreement for a list like the one you are talking about is (I think, most simply) quantitatively thought of as some combination of the the absolute differences in rank scores between the two lists. Everything else that happens to the numbers is fancy nonsense of one kind or another. – russellpierce Feb 20 '12 at 17:53
  • I *think* that according to the traditional statistical comparassions such as Spearman's `([1,2,3],[1,2,4])` and `([4,1,2],[3,1,2])` will both yield the same result, but it is more important to be correct in the firsts elements. Or did I misunderstand this method? :\ – amit Feb 20 '12 at 18:04
  • Amit, you are correct. The example you provide is a little perplexing because the pairs are the same in both examples 1 and 1, 2 and 2, and 3 and 4... but no matter. The key point you make is sound (and I was about to mention it on your question)... Spearman does not make the assumption the agreement in early ranks is more important than in later ranks. That is a useful sort of "fancy nonsense" that requires assumptions perhaps implicit, but certainly not explicit, in OPs question. – russellpierce Feb 20 '12 at 18:11
  • 1
    Notably, it is that assumption, that disagreement matters most in the early ranks that makes partially ignoring the issue of missing documents possible. Specifically, missing documents are considered under a DCG approach to be just 'lost signal' whereas in a Spearman Rho approach, including those results would drastically change the value of the statistic. Could a similar problem occur in DCG? Considering your example... what would happen to the values if you suddenly knew something about the 3rd rank document? – russellpierce Feb 20 '12 at 18:16
2

As you said, you want to compute how similar one list is to the other. I think simplistically, you can start by counting the number of Inversions. There's a O(NlogN) divide and conquer approach to this. It is a very simple approach to measure the "similarity" between two lists.

e.g. you want to compare how 'similar' the music tastes are for two persons on a music website, you take their rankings of a set of songs and count the no. of inversions in it. Lesser the count, more 'similar' their taste is.

since you are already considering the "state of the art system" to be a benchmark of correctness, counting Inversions should give you a basic measure of 'similarity' of your ranking. Of course this is just a starters approach, but you can build on it as how strict you want to be with the "inversion gap" etc.

    D1 D2 D3 D4 D5 D6
    -----------------
R1: 1, 7, 4, 5, 8, 9  [Rankings from 'state of the art' system]
R2: 1, 7, 5, 4, 9, 6  [ your Rankings]

Since rankings are in order of documents you can write your own comparator function based on R1 (ranking of the "state of the art system" and hence count the inversions comparing to that comparator.

You can "penalize" 'similarity' for each inversions found: i < j but R2[i] >' R2[j]
( >' here you use your own comparator)

Links you may find useful:
Link1
Link2
Link3

srbhkmr
  • 2,074
  • 1
  • 14
  • 19
1

I suppose you are talking about comparing two Information Retrieval System which trust me is not something trivial. It is a complex Computer Science problem.

For measuring relevance or doing kind of A/B testing you need to have couple of things:

  1. A competitor to measure relevance. As you have two systems than this prerequisite is met.

  2. You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).

Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.

According to wikipedia:

Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.

Wikipedia explains NDCG in a great manner. It is a short article, please go through that.

Yavar
  • 11,883
  • 5
  • 32
  • 63
  • I am not trying to compare which system is better. I am just trying to prove that my results are similar to the state of the art system. How does NDCG help me here – user1221572 Feb 20 '12 at 17:22