Algorithm to sort rows and cols by similarity

Question

I fell across a spreadsheet that explains a method to sort both rows and columns of a matrix that contains binary data so that the number of changes between consecutive rows and cols is minimzed.

For example, starting with:

After 15 manual steps described in the tabs of the spreadsheed, the following table is obtained:

I would like to know:

what is the common name of this algorithm or method ?
how to apply it to larger table (where 2^n would overflow...)
how to generalize it to non binary data, for example using Levenshtein distance ?
if there is any link to code (Excel VBA, Python, ...) already implementing this (otherwise I'll write it ... )

Thanks !

This is euclidean hamiltonian path in {0,1}^n; I think there might be constant-factor approximation algorithms since hampath is closely related to TSP (both hampath and TSP are np-hard for general graphs), and we have approximation algorithms for TSP, but don't expect to solve it optimally - although I'm not entirely sure that a hardness proof for this specific space exists, I'd be surprised if this was in P. I don't know what VBA can do, so I can't tell you whether you can implement an approximation algorithm there. — G. Bach, Apr 11 '16 at 09:11
Having a second look, the distance is actually not euclidean, but the Hamming distance; I don't know hardness proofs or approximation algorithms for that one, but they probably exist. — G. Bach, Apr 11 '16 at 09:19
Related: [Gray codes](https://en.wikipedia.org/wiki/Gray_code), also available as n-ary variants. — Norman, Apr 11 '16 at 10:46
I was actually wrong; this isn't hampath, it's TSP without returning to the source, which is the TSP path problem (possibly referred to as TSPP in the literature). I also could've added that for non-binary coordinates, your problem is exactly euclidean TSPP; see also [this](http://stackoverflow.com/questions/20501634/traveling-salesman-without-return-and-with-given-start-and-end-cities). — G. Bach, Apr 11 '16 at 12:44
I have to say, strictly speaking, this "question" is probably a poor fit for the Stack Overflow format. I mean, it's got multiple subquestions (which is discouraged), those component questions are arguably not *programming* questions (so off-topic), or are too broad (answering the question properly would require writing way too much). Despite all that, it's an *interesting* question, and it's cool to see that spreadsheet! — John Y, Apr 13 '16 at 21:29
To answer subquestion 2: If you use Python to extract the data and then work purely in Python, you won't have overflow because Python has arbitrary-precision integers built-in. — John Y, Apr 13 '16 at 21:33

score 3 · Answer 1 · answered Aug 27 '16 at 05:46

You can represent each row by a vector L = [1, 1, 0, ... 1], and then define the distance between two lines d(L0, L1) by the number of elements at corresponding positions which are different between L0 and L1. This is known as the binary Hamming distance. If you had non-binary data, you would just extend your definition of distance and yes, Levenshtein distance would be an option.

Once you have distance well-defined, the rest of your problem is minimizing distance between consecutive rows. This is exactly the Traveling salesman problem, which is known to be NP-hard(http://www.diku.dk/hjemmesider/ansatte/jyrki/Paper/EKP85.pdf).

The direct solution (visiting all permutations) is O(n!), but you can do better easily by using dynamic programming, for example Held–Karp_algorithm. There are also approximate algorithms, such as the Nearest_neighbour_algorithm which quickly computes a non-optimal solution.

Finally, for implementations you can easily google "traveling salesman excel/python" and find many tutorials and examples.

Algorithm to sort rows and cols by similarity

1 Answers1