0

Problem: It appears to me that a fundamental property of a clustering method c() is whether we can combine the results c(A) and c(B) by some function f() of two clusterings in a way that we do not have to apply the full clustering c(A+B) again but instead do f(c(A),c(B)) and still end up with the same result:

c(A+B) == f(c(A),c(B))

I suppose that a necessary condition for some c() to have this property is that it is determistic, that is the order of its internal processing is irrelevant for the result. However, this might not be sufficient.

It would be really nice to have some reference where to look up which cluster methods support this and what a good f() looks like in the respective case.


Example: At the moment I am thinking about DBSCAN which should be deterministic if I allow border points to belong to multiple clusters at the same time (without connecting them):

  1. One point is reachable from another point if it is in its eps-neighborhood
  2. A core point is a point with at least minPts reachable
  3. An edge goes from every core point to all points reachable from it
  4. Every point with incoming edge from a core point is in the same cluster as the latter

If you miss the noise points then assume that each core node reaches itself (reflexivity) and afterwards we define noise points to be clusters of size one. Border points are non-core points. Afterwards if we want a partitioning, we can assign randomly the border points that are in multiple clusters to one of them. I do not consider this relevant for the method itself.

Radio Controlled
  • 825
  • 8
  • 23

1 Answers1

2

Supposedly the only clustering where this is efficiently possible is single linkage hierarchical clustering, because edges removed from A x A and B x B are not necessary for finding the MST of the joined set.

For DBSCAN precisely, you have the problem that the core point property can change when you add data. So c(A+B) likely has core points that were not core in either A not B. This can cause clusters to merge. f() supposedly needs to re-check all data points, i.e., rerun DBSCAN. While you can exploit that core points of the subset must be core of the entire set, you'll still need to find neighbors and missing core points.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Yes I think I had similar thoughts, especially one can easily forget that the core point property can change. I suppose if one creates the partitions for distributed computation, one will do it such that this cannot happen (haven't read the literature, though). For the cases I have studied the similarity computation and eps-neighborhood is the most expensive part (I think even in complexity terms), so I am thinking to update the similarity matrix and add the transitive closure of reachability as computed so far to the 'edge' matrix (converges faster), which however is not memory efficient... – Radio Controlled Feb 16 '19 at 08:50
  • 1
    There are parallel versions of DBSCAN based on meeting results. But they don't formally use c(A) and c(B), but are asymmetric. – Has QUIT--Anony-Mousse Feb 16 '19 at 17:25
  • The question I have to ask each of theses parallel versions is: Given two samples x,y where x=y, can it happen that x is in A and y is in B? I suppose usually if the partitioning is intentional this will not happen, but I'd have to check for the different approaches. – Radio Controlled Feb 16 '19 at 17:37
  • 1
    That question usually does not apply to the methods, because they don't partition the data that trivially. The obvious thing to do here is to use disjoint partitions a, b for the queries, but the *entire* data for the query results. Then the core property is guaranteed to be correct, and you compute it exactly once. You then *can* assemble the correct DBSCAN result. But it's f(cmodified(a,everything),cmodified(b,everything)) not f(c(a),c(b)). – Has QUIT--Anony-Mousse Feb 16 '19 at 21:26