What clustering algorithm is suitable for 2d rectangles without knowing the number of clusters ahead of time?

Question

The problem I have is that there are rectangles within rectangles. Think of a map, except with the following traits with the key point being: rectangles with similar density often share similar dimensions and similar position on the x axis with other rectangles, but sometimes the distance between these rectangles may be big but usually small. If the position on x axis or dimensions are clearly way off, they would not be similar.

rectangles do not intersect, smaller rectangles are completely inside a larger rectangle.
rectangles often have similar x position and similar dimensions (similar height and width), and have smaller rectangles inside it. The rectangle itself would be considered a cluster of it's own.
Sometimes the distance of these cluster from another cluster may be quite big (think of islands). Often these clusters share the same or similar dimension and same or similar density of sub rectangles. if so, they should be considered as part of the same cluster despite a distance between the two clusters.
The more dense a rectangle is (smaller rectangles inside), the more likely there is a similar or same dense rectangle sharing same or similar dimension nearby.

I've attached a diagram to describe the situation more clearly:

enter image description here

Red border means those groups are outliers, not part of any cluster and are ignored.
Blue border has many clusters (black borders containing black solid rectangles). They form a group of clusters that are similar due to the criteria mentioned above (similar width, similar X position, similar density). Even the clusters towards the bottom right corner is still considered part of this group because of the criteria (similar width, similar X position, similar density).
Turquois border has many clusters (black borders containing black solid rectangles). However, these clusters differ in dimension, x position, and density, from the ones in Blue border. They are considered a group of their own.

So far I found density clustering such as DBSCAN which seems to be perfect since it takes noise (outliers) into consideration, and you do not need to know ahead of time how many clusters there will be.

However, you need to define the minimum number of points needed to form a cluster and a threshold distance. What happens if you don't know these two and it can vary based on the problem described above?

Another seemingly plausible solution would be hierarchial (agglomerative) clustering (r-tree), but I'm concerned that I would still need to know the cutoff point in the tree depth level to determine if it's a cluster.

score 4 · Answer 1 · answered Dec 27 '13 at 10:13

You certainly will need to take all your constraints into account.

In general, your task looks more like constraint satisfaction to me than clustering.

Maybe some constraint clustering approaches are useful to you, but I'm not sure if they allow your kind of constraints. Usually, they only support must-link and must-not-link constraints.

But of course you should try DBSCAN (in particular also: generalized DBSCAN, since the generalization might allow you to add the constraints you have!) and R-trees (which aren't actually a clustering algorithm, but a data index).

Note that R-trees will put the "outliers" into some leaf, to ensure minimum fill.

As is, I cannot give you more detailed recommendations, because even from above sketch, your constraints are not well defined IMHO. Try putting them into pseudocode. You probably only have a small number of rectangles (say, 100); so you can afford to run really expensive algorithms, such as linkage clustering with a customized linkage criterion. Putting your criterions into code may already be 99% of the effort!

essentially I am looking to cluster similar dimensioned rectangles and with similar densities. Does this clarify the constraints more clearly? What if I use hierarchial clustering and determine a cluster by similar density values (k-means?) — javastudent, Dec 28 '13 at 00:09
You can't plug in custom similarities into k-means. You really need to figure out what "similar dimensioned" means to *you*. Because there is no universal definition of that. — Has QUIT--Anony-Mousse, Dec 28 '13 at 00:26
similar dimensioned would be +/-20 in width and height. I just tried DBSCAN and the epsilon kills it for me everytime. I need to do a long trial and error. OPTICS sounds good (as epsilon is not required) but I can't find any implementation in javascript or python — javastudent, Dec 28 '13 at 00:45

What clustering algorithm is suitable for 2d rectangles without knowing the number of clusters ahead of time?

1 Answers1