3

I was wondering if there is an implementation in R where it sorts a permutation of n numbers into the original 1...n sequence and provides the number of reversals needed. Eg an implementation of the "sorting by reversals" or "sorting by translocation" as outlined in this ppt.

Specifically, I have a permutation of a sequence of n elements, pi(n), and I want to figure out how close it is to the original sequence. The number of reversals seems a good metric.

Thanks!

Rguy
  • 1,622
  • 1
  • 15
  • 20
user1357015
  • 11,168
  • 22
  • 66
  • 111
  • 2
    this doesn't quite answer your question, but Kendall's tau http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient gives a statistic that would seem to be very appropriate in this context (scaled value of [# concordant pairs] - [# discordant pairs]); `cor.test(...,method="kendall")` returns the value for specified x and y values (i.e. `cor.test(seq(n),pi(n),method="kendall")$statistic`) – Ben Bolker Oct 14 '12 at 22:59

1 Answers1

2

This seems like a job for Kendall's distance (also known, sometimes, as the Bubble-sort distance). It is probably the most commonly used metric to measure distance in ranking space.

The Kendall distance counts the number of times that two sequences differ in their ordering of the items in two indices. In the case that one of the sequences is the trivial sequence (1, 2, ..., n), we can measure the distance simply by counting the number of times that i < j and pi(i) > pi(j).

If you like this metric (it is equivalent to the minimum number of pairwise transpositions of adjacent items you would have to complete to transform one sequence into 1:n), you can find it in my package, RMallow, available on CRAN. The function is called AllSeqDists. Here is an example:

library(RMallow)
# Create a matrix of sequences, each of length 5
datas <- matrix(c(1:5, 5:1, c(2, 1, 3, 4, 5), c(5, 1, 2, 3, 4), c(1, 2, 4, 5, 6), c(1, 5, 6, 2, 4)), nrow = 6, byrow = TRUE)
# Calculate all of their Kendall distances to the sequence (1, 2, 3, 4, 5)
datas <- SimplifySequences(datas)
dists <- AllSeqDists(datas)

You might also consider Spearman's metric.
Also, there are a class of models on ranking data that I must plug called "Mallows models", depending on what you want to do.

Rguy
  • 1,622
  • 1
  • 15
  • 20
  • Hi, this sounds pretty good! Would AllSeqDists work if there was a gap in the sequence (say 4,2,1,5,6)? – user1357015 Oct 15 '12 at 15:43
  • Also, is this any different than the Damerau-Levenshtein distance with only transpositions allowed? – user1357015 Oct 15 '12 at 15:49
  • @user137015 I updated my answer to reflect additional functionality to answer your first question. And the answer to your second question is: evidently it is identical, and you have found yet another name for this metric! – Rguy Oct 15 '12 at 17:58
  • and now that you have the name, you can try `install.packages("sos"); library("sos"); findFn("levenshtein distance")` ... – Ben Bolker Oct 15 '12 at 19:51