Algorithmic complexity of group average clustering

Question

I've been reading lately about various hierarchical clustering algorithms such as single-linkage clustering and group average clustering. In general, these algorithms don't tend to scale well. Naive implementations of most hierarchical clustering algorithms are O(N^3), but single-linkage clustering can be implemented in O(N^2) time.

It is also claimed that group-average clustering can be implemented in O(N^2 logN) time. This is what my question is about.

I simply do not see how this is possible.

Explanation after explanation, such as:

http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html

http://nlp.stanford.edu/IR-book/completelink.html#averagesection

https://en.wikipedia.org/wiki/UPGMA#Time_complexity

... are claiming that group average hierarchical clustering can be done in O(N^2 logN) time by using priority queues. But when I read the actual explanation or pseudo-code, it always appears to me that it is nothing better than O(N^3).

Essentially, the algorithm is as follows:

For an input sequence of size N:

Create a distance matrix of NxN #(this is O(N^2) time)
For each row in the distance matrix:
   Create a priority queue (binary heap) of all distances in the row

Then:

For i in 0 to N-1:
  Find the min element among all N priority queues # O(N)
  Let k = the row index of the min element

  For each element e in the kth row:
    Merge the min element with it's nearest neighbor
    Update the corresponding values in the distance matrix
    Update the corresponding value in priority_queue[e]

So it's that last step that, to me, would seem to make this an O(N^3) algorithm. There's no way to "update" an arbitrary value in the priority queue without scanning the queue in O(N) time - assuming the priority queue is a binary heap. (A binary heap gives you constant access to the min element and log N insertion/deletion, but you can't simply find an element by value in better than O(N) time). And since we'd scan the priority queue for each row element, for each row, we get (O(N^3)).

The priority queue is sorted by a distance value - but the algorithm in question calls for deleting the element in the priority queue which corresponds to k, the row index in the distance matrix of the min element. Again, there's no way to find this element in the queue without an O(N) scan.

So, I assume I'm probably wrong since everyone else is saying otherwise. Can someone explain how this algorithm is somehow not O(N^3), but in fact, O(N^2 logN) ?

score 2 · Accepted Answer · answered Aug 30 '16 at 04:15

I think you are saying that the problem is that in order to update an entry in a heap you have to find it, and finding it takes time O(N). What you can do to get round this is to maintain an index that gives, for each item i, its location heapPos[i] in the heap. Every time you swap two items to restore the heap invariant you then need to modify two entries in heapPos[i] to keep the index correct, but this is just a constant factor on the work done in the heap.

score 1 · Answer 2 · answered Aug 30 '16 at 06:35

If you store the positions in the heap (which adds another O(n) memory) you can update the heap without scanning, on the changed positions only. These updates are restricted to two paths on the heap (one removal, one update) and execute in O(log n). Alternatively, you could binary-search by the old priority, which will likely be in O(log n), too (but slower, above approach is O(1)).

So IMHO you can indeed implement these in O(n^2 log n). But the implementation will still use a lot (O(n^2)) of memory, anything of O(n^2) does not scale. You usually run out of memory before you run out of time if you have O(n^2) memory...

Implementing these data structures is quite tricky. And when not done well, this may end up being slower than a theoretically-worse approach. For example Fibonacci heaps. They have nice properties on paper, but have too high constant costs to pay off.

Malcolm McLean · Answer 3 · 2016-08-30T06:54:59.760

-2

No, because the distance matrix is symmetrical.

if the first entry in row 0 is to column 5, distance of 1, and that is lowest in the system, then the first entry in row 5 must be the complementary entry to column 0, with a distance of 1.

In fact you only need a half matrix.

edited Aug 30 '16 at 06:54

answered Aug 30 '16 at 06:27

Malcolm McLean

6,258
1
17
18

You do realize that 0.5 * n^2 is still in O(n^2)? **Saving half of the matrix does not reduce asymptotic complexity**. And you misue "reciprocal". The way you use it, you are saying d(x,y) = 1 / d(y,x) but distances are symmetric not reciprocal? – Has QUIT--Anony-Mousse Aug 30 '16 at 06:51
It means that finding the complementary (better word) priority queue entry is O(1). The global minimum is represented twice, both of which must the first entries in their priority queues. – Malcolm McLean Aug 30 '16 at 06:56
Above approach uses (for a good reason) one priority queue per entry, because otherwise you need to discard O(n) entries every time. – Has QUIT--Anony-Mousse Aug 30 '16 at 06:59

Algorithmic complexity of group average clustering

3 Answers3