1

I have an ordered collections:

[Doc1, Doc2, Doc3, Doc4, Doc5] 

where Doc1 is ranked ahead of Doc2 (imagine a search query situation, where this ordered collection is the result of the search.

Now, say I have a second ordered collection:

[Doc1, Doc2, Doc3, Doc5, Doc4]

I need a way to quantify this difference as a numerical score. It must also take into account the weight too, so that [Doc1, Doc2, Doc3, Doc5, Doc4] is closer to the original collection, then [Doc2, Doc1, Doc3, Doc4, Doc5] is, because the difference occurs closer to the top.

I have considered a Levenshtein difference, but couldn't see how to consider the order.

Dominic Bou-Samra
  • 14,799
  • 26
  • 100
  • 156
  • On what basis is this closeness in ordering identified? – Sachin Shanbhag Oct 20 '12 at 07:58
  • Not quite sure what you mean sorry. – Dominic Bou-Samra Oct 20 '12 at 08:00
  • You are probably looking for something like [DCG](http://en.wikipedia.org/wiki/Discounted_Cumulative_Gain) or [nDCG](http://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG). I elaborated on it in [this thread](http://stackoverflow.com/questions/9365184/computing-similarity-between-two-lists), which is basically an identical question I believe. – amit Oct 20 '12 at 08:05

1 Answers1

1

According to Wikipedia the Levenshtein distance can be calculated using a following piece of pseudocode.

int LevenshteinDistance(string s, string t)
{
  int len_s = length(s), len_t = length(t), cost = 0;
  if (s[0] != t[0])
    cost = 1;
  if (len_s == 0)
    return len_t;
  else if (len_t == 0)
    return len_s;
  else
    return minimum(
        LevenshteinDistance(s[1..len_s], t) + 1,
        LevenshteinDistance(s, t[1..len_t]) + 1,
        LevenshteinDistance(s[1..len_s], t[1..len_t]) + cost);
}

If I understand your requirement correctly you want differences at the beginning of the collection to be more significant than differences towards the end. Let's adapt this recursive function to reflect this demand.

float LevenshteinDistance(string s, string t, float decay)
{
  int len_s = length(s), len_t = length(t), cost = 0;
  if (s[0] != t[0])
    cost = 1;
  if (len_s == 0)
    return len_t;
  else if (len_t == 0)
    return len_s;
  else
    return decay * minimum(
        LevenshteinDistance(s[1..len_s], t, decay) + 1,
        LevenshteinDistance(s, t[1..len_t], decay) + 1,
        LevenshteinDistance(s[1..len_s], t[1..len_t], decay) + cost);
}

When decay is a parameter belonging to the interval (0,1) differences on larger indices become less significant than differences on previous ones.

Here's an example for decay=0.9.

s       t       dist
"1234"  "1234"  0.0000
"1234"  "1243"  1.3851
"1234"  "2134"  1.6290
Jan
  • 11,636
  • 38
  • 47