The problem I have is that there are rectangles within rectangles. Think of a map, except with the following traits with the key point being: rectangles with similar density often share similar dimensions and similar position on the x axis with other rectangles, but sometimes the distance between these rectangles may be big but usually small. If the position on x axis or dimensions are clearly way off, they would not be similar.
rectangles do not intersect, smaller rectangles are completely inside a larger rectangle.
rectangles often have similar x position and similar dimensions (similar height and width), and have smaller rectangles inside it. The rectangle itself would be considered a cluster of it's own.
Sometimes the distance of these cluster from another cluster may be quite big (think of islands). Often these clusters share the same or similar dimension and same or similar density of sub rectangles. if so, they should be considered as part of the same cluster despite a distance between the two clusters.
- The more dense a rectangle is (smaller rectangles inside), the more likely there is a similar or same dense rectangle sharing same or similar dimension nearby.
I've attached a diagram to describe the situation more clearly:
Red border means those groups are outliers, not part of any cluster and are ignored.
Blue border has many clusters (black borders containing black solid rectangles). They form a group of clusters that are similar due to the criteria mentioned above (similar width, similar X position, similar density). Even the clusters towards the bottom right corner is still considered part of this group because of the criteria (similar width, similar X position, similar density).
Turquois border has many clusters (black borders containing black solid rectangles). However, these clusters differ in dimension, x position, and density, from the ones in Blue border. They are considered a group of their own.
So far I found density clustering such as DBSCAN which seems to be perfect since it takes noise (outliers) into consideration, and you do not need to know ahead of time how many clusters there will be.
However, you need to define the minimum number of points needed to form a cluster and a threshold distance. What happens if you don't know these two and it can vary based on the problem described above?
Another seemingly plausible solution would be hierarchial (agglomerative) clustering (r-tree), but I'm concerned that I would still need to know the cutoff point in the tree depth level to determine if it's a cluster.