Problem: It appears to me that a fundamental property of a clustering method c()
is whether we can combine the results c(A)
and c(B)
by some function f()
of two clusterings in a way that we do not have to apply the full clustering c(A+B)
again but instead do f(c(A),c(B))
and still end up with the same result:
c(A+B) == f(c(A),c(B))
I suppose that a necessary condition for some c()
to have this property is that it is determistic, that is the order of its internal processing is irrelevant for the result. However, this might not be sufficient.
It would be really nice to have some reference where to look up which cluster methods support this and what a good f()
looks like in the respective case.
Example: At the moment I am thinking about DBSCAN which should be deterministic if I allow border points to belong to multiple clusters at the same time (without connecting them):
- One point is reachable from another point if it is in its eps-neighborhood
- A core point is a point with at least minPts reachable
- An edge goes from every core point to all points reachable from it
- Every point with incoming edge from a core point is in the same cluster as the latter
If you miss the noise points then assume that each core node reaches itself (reflexivity) and afterwards we define noise points to be clusters of size one. Border points are non-core points. Afterwards if we want a partitioning, we can assign randomly the border points that are in multiple clusters to one of them. I do not consider this relevant for the method itself.