I have implemented an algorithm for hierarchical clustering and a simple method for drawing the dendrogram in C#. Now I want to add dendrogram cutoff method and another one for coloring dendrogram branches. What would be an efficient algorithm to do that?
The cutoff method should return a list of dendrogram nodes beneath which each subtree represents a single cluster. My data structure is a simple binary tree represented by a Root Node
the node structure is as follows:
class DendrogramNode
{
String Id { get; set; }
DendrogramNode LeftNode { get; set; }
DendrogramNode RightNode { get; set; }
Double Height { get; set; }
}
the CutOff method should have the following signature
List<DendrogramNode> CufOff(int numberOfClusters)
What I did so far:
My first attempt was to create a list of all DendrogramNodes and sort them in descending order. Then take numberOfClusters first entries from the sorted list. - This fails because we may end up with a list containing parent nodes that all children also belong to. In such situation parent nodes should be removed.
Second attempt was to create a list off all linkages and store them in linkage order. This way I could take last numberOfClusters linages and use them to create cutoff list - this works fine, but I don't like to store this information, as it is hard to maintain (specially for iterative clustering)
It sees like a simple problem but somehow I have stacked on this. Can you help me find an efficient solution?
I guess the solution 1 was OK to some point, but then there should be some part that removes parent nodes when all their children are also on the list, ad it should be somehow iterative/recursive, as removing a node creates space to add another.