1

Is this a correct implementation of Kendall tau distance in Scala

def distance[A : Ordering](s: Seq[A], t: Seq[A]): Int = {
  assert(s.size == t.size, "Both sequences should be of the same length")

  s.combinations(2).zip(t.combinations(2)).count { 
    case (Seq(s1, s2), Seq(t1, t2)) =>
      (s1 > s2 && t1 < t2) || (s1 < s2 && t1 > t2)
  }
}

The problem is I do not have enough data to test the algorithm on, only a few examples from Wikipedia. And I do not understand the algorithm well enough to generate my own test data. Most sources are about Kendall tau rank correlation coefficient, which is related but different animal. Maybe I could somehow derive one from the other?

For now let's say that performance is not important.

UPDATE

So, now I have three implementations of Kendall tau distance algorithm. Two of them (distance1 and distance3) give identical results (see bellow). So, which one is correct?

import scala.math.Ordering.Implicits._

val permutations = Random.shuffle((0 until 5).permutations).take(100)

println("s\tt\tDist1\tDist2\tDist3")
permutations.sliding(2).foreach { case Seq(s, t) =>
  println(s.mkString(",")+"\t"+t.mkString(",")+"\t"+distance1(s, t)+"\t"+distance2(s, t)+
    "\t"+distance3(s, t))
}

def distance1[A : Ordering](s: Seq[A], t: Seq[A]): Int = {
  assert(s.size == t.size, "Both sequences should be of the same length")

  s.combinations(2).zip(t.combinations(2)).count { case (Seq(s1, s2), Seq(t1, t2)) =>
    (s1 > s2 && t1 < t2) || (s1 < s2 && t1 > t2)
  }
}

def distance2[A](a: Seq[A], b: Seq[A]): Int = {
  val aMap = a.zipWithIndex.toMap // map of a items to their ranks
  val bMap = b.zipWithIndex.toMap // map of b items to their ranks

  a.combinations(2).count{case Seq(i, j) =>
    val a1 = aMap.get(i).get // rank of i in A
    val a2 = aMap.get(j).get // rank of j in A
    val b1 = bMap.get(i).get // rank of i in B
    val b2 = bMap.get(j).get // rank of j in B
    a1.compare(a2) != b1.compare(b2)
  }
}

def distance3(τ_1: Seq[Int], τ_2: Seq[Int]) =
  (0 until τ_1.size).map { i =>
    (i+1 until τ_2.size).count { j =>
      (τ_1(i) < τ_1(j) && τ_2(i) > τ_2(j)) || (τ_1(i) > τ_1(j) && τ_2(i) < τ_2(j))
    }
  }.sum

And here are some results:

s   t   Dist1   Dist2   Dist3
3,0,4,2,1   1,4,3,0,2   6   6   6
1,4,3,0,2   0,4,1,2,3   3   5   3
0,4,1,2,3   4,0,1,3,2   8   2   8
4,0,1,3,2   1,2,0,4,3   4   6   4
1,2,0,4,3   2,3,1,4,0   3   5   3
2,3,1,4,0   1,0,3,2,4   8   6   8
1,0,3,2,4   1,3,2,4,0   7   3   7
1,3,2,4,0   4,3,0,1,2   6   6   6
4,3,0,1,2   1,0,2,4,3   7   7   7
1,0,2,4,3   3,4,1,2,0   8   8   8
3,4,1,2,0   1,4,2,0,3   5   5   5
1,4,2,0,3   1,0,3,4,2   8   4   8
Vilius Normantas
  • 3,708
  • 6
  • 25
  • 38
  • What do you think the values of the sequence represent? I ask because distance2 gets the ranks from the position of the values in their sequence, but distance3 treats the values as ranks themselves, which is why the results are different (in other words, different inputs lead to different outputs). – Will Fitzgerald Jul 09 '14 at 15:07
  • Some random permutation of rankings. Am I wrong? – Vilius Normantas Jul 09 '14 at 15:12
  • Now I'm completely confused. Let me rephrase the question. How do I calculate the Kendal tau distance between the pairs of permutations, as in the example above? – Vilius Normantas Jul 09 '14 at 15:16
  • The distance2 does *not* assume they are ranks, but items, so that is why is is different from distance3. Input to distance2, for example, could be tauDistance(List('a','b','c'), List('c','b','a')) and it uses the positions of the items in their sequences as their ranks. – Will Fitzgerald Jul 09 '14 at 15:35

2 Answers2

1

I don't think this is quite right. Here's some quickly written code that emphasizes that what you are comparing is the rank of the items in the sequences (you don't really want to keep those get(n).get calls in your code though). I used compare, too, which I think makes sense:

def tauDistance[A](a: Seq[A], b: Seq[A]) = {
  val aMap = a.zipWithIndex.toMap // map of a items to their ranks
  val bMap = b.zipWithIndex.toMap // map of b items to their ranks
  a.combinations(2).count{case Seq(i, j) =>
    val a1 = aMap.get(i).get // rank of i in A
    val a2 = aMap.get(j).get // rank of j in A
    val b1 = bMap.get(i).get // rank of i in B
    val b2 = bMap.get(j).get // rank of j in B
    a1.compare(a2) != b1.compare(b2)
  }
}
Will Fitzgerald
  • 1,372
  • 10
  • 14
  • I hacked a quick test with some random permutations to see how the algorithms compare, and yes in many cases results differ. Now, pardon my arrogance, but how do I know which one is correct? Where could I find some calculator or examples to verify? – Vilius Normantas Jul 09 '14 at 05:29
  • Yes, they will differ, because I believe what your code is doing is incorrect. The second `zip` is the cause; it is unnecessary and semantically wrong. Remember, need to compare the *ranks* of the items in the sequences, not the items themselves. I'll put another example below, which implements the first Wikipedia definition rather directly. – Will Fitzgerald Jul 09 '14 at 12:27
  • I finally concluded that this implementation is the correct one. I found another implementation (in Java: http://algs4.cs.princeton.edu/22mergesort/Inversions.java.html), which gives the same results (as long as permutations are 0 to n-1). – Vilius Normantas Jul 10 '14 at 06:07
1

So, the Wikipedia defines K on the ranks of the elements like this:

K(τ_1,τ_2) = |{(i,j): i < j, (τ_1(i) < τ_1(j) && τ_2(i) > τ_2(j)) || (τ_1(i) > τ_1(j) && τ_2(i) < τ_2(j))}|

We can implement this pretty directly in Scala, remembering that the inputs are sequences of ranks, not the items themselves:

def K(τ_1: Seq[Int], τ_2: Seq[Int]) = 
  (0 until τ_1.size).map{i => 
    (i+1 until τ_2.size).count{j => 
      (τ_1(i) < τ_1(j) && τ_2(i) > τ_2(j)) || (τ_1(i) > τ_1(j) && τ_2(i) < τ_2(j))
    }
  }.sum

This is actually a bit preferable to the tauDistance function above, since that function assumes all the items are unique (and so will fail if the sequences have duplicates) while this one works on the ranks directly.

Working with combinatoric functions is hard sometimes, and it's often not enough just to have unit tests that pass.

Will Fitzgerald
  • 1,372
  • 10
  • 14
  • I appreciate the time you spend trying to help me with this problem. Yet, we seem to have a problem here. I ran the same tests again, and it seems that the results of this last implementation matches with my initial implementation, but not the one you've suggested in the previous post. I've updated my question to show what exactly I did. – Vilius Normantas Jul 09 '14 at 14:00
  • You're more than welcome, and I think the basic issue is the rank vs. item confusion (on my part, perhaps). – Will Fitzgerald Jul 09 '14 at 15:40