Translation of clustering problem to graph theory language

Question

I have a rectangular planar grid, with each cell assigned some integer weight. I am looking for an algorithm to identify clusters of 3 to 6 adjacent cells with higher-than-average weight. These blobs should have approximately circular shape.

For my case the average weight of the cells not containing a cluster is around 6, and that for cells containing a cluster is around 6+4, i.e. there is a "background weight" somewhere around 6. The weights fluctuate with a Poisson statistic.

For small background greedy or seeded algorithms perform pretty well, but this breaks down if my cluster cells have weights close to fluctuations in the background i.e. they will tend to find a cluster even though there is nothing. Also, I cannot do a brute-force search looping through all possible setups because my grid is large (something like 1000x1000) and I plan to do this very often (10^9 times). I have the impression there might exist ways to tackle this in graph theory. I heard of vertex-covers and cliques, but am not sure how to best translate my problem into their language. I know that graph theory might have issues with the statistical nature of the input, but I would be interest to see what algorithms from there could find, even if they cannot identify every cluster.

Here an example clipping: the framed region has on average 10 entries per cell, all other cells have on average 6. Of course the grid extends further.

| 8|  8|  2|  8|  2|  3| 
| 6|  4|  3|  6|  4|  4| 
        ===========
| 8|  3||13|  7| 11|| 7|
|10|  4||10| 12|  3|| 2|
| 5|  6||11|  6|  8||12|
        ===========
| 9|  4|  0|  2|  8|  7|

Sounds like a long-ago-solved problem in computer vision. For instance, trying to detect dim stars on dark bg. While astronomers might have lots of resources, they also would not mind doing it fast. I suspect that your problem has been solved before. By the way, I would stick to a matrix data structure. It is faster to work with and it makes sense. What are you trying to do? — Hamish Grubijan, Apr 17 '10 at 14:49
1000x1000 is not that much. You could: A) Use a good fast "matrix" library. SciPy B) compute the average "brightness". C) Iterate over every cell left to right, then top to bottom. Now assume square shape 3x3. Compute the average density. Looks about average - keep going. Looks higher than normal? Investigate the region then. The problem can easily occur if you have large regions (think clouds in the sky next to dim stars). These clouds can mess everything up. One must make assumptions about what the data looks like. — Hamish Grubijan, Apr 17 '10 at 14:58
@Hamish: And for "solved", I am looking for a reliable way (few false positives). AFAIK this could a NP-hard problem — Benjamin Bannier, Apr 17 '10 at 14:59
@honk, this is not even a well-defined problem. Your spec is weak. Particularly: what is "higher than average"? 1 std (whatever that might mean) away? Just about anything in theory is an NP-complete problem. Practitioners like those in the field of computer vision just accept the reality and deal with it the best way they can. You plan to do what 1 billion times? You need to process 1B "images"? Give us some background on what you are trying to do. Process real-time data? By the way, wavelets might be useful, but I do not really understand them. — Hamish Grubijan, Apr 17 '10 at 15:08
I am looking for a different way to do tower clustering in a calorimeter, see e.g. http://scholar.google.com/scholar?q=calorimeter+clustering&hl=en&btnG=Search&as_sdt=20000000001&as_sdtp=on — Benjamin Bannier, Apr 17 '10 at 15:12
Well, I believe that graph theory is a wrong direction. Converting this matrix into a graph will hardly buy you anything. The vertices will have the same weight as cells, but the weight of the edges is meaningless. Also, locating a neighbor is easy enough with a plain matrix - just look directly South(S) S, North(N), E, W, SE, SW, NE, NW. I believe that this problem is statistical in nature (like it or not), for instance because instruments that record the data are im-perfect; there is always noise. Deterministic algorithms do not apply here. Set cover and clique are too theoretical for this. — Hamish Grubijan, Apr 17 '10 at 15:23
@Hamish You read too much into my question: I do not expect a solution from anybody, I ask for a translation from "grid with weights" translated to "nodes and edges and weights" — Benjamin Bannier, Apr 17 '10 at 20:28

score 1 · Answer 1 · edited Apr 13 '17 at 12:57

1

For graph theory solutions there are a couple of sentences in wikipedia, but you are probably best posting on MathOverflow. This question might also be useful.

The traditional method (and probably best considering its ubiquity) in computing to solve these is raster analysis - well known in the world of GIS and Remote Sensing, so there are a number of tools that provide implementations. Keywords to use to find the one most suited to your problem would be raster, nearest neighbor, resampling, and clustering. The GDAL library is often the basis for other tools.

E.g. http://clusterville.org/spatialtools/index.html

You could try checking out the GDAL library and source code to see if you can use it in your situation, or to see how it is implemented.

For checking for circular shapes you could convert remaing values to polygons and check the resultant feature

http://www.gdal.org/gdal_polygonize.html

edited Apr 13 '17 at 12:57

Community

1
1

answered Apr 17 '10 at 17:32

geographika

6,458
4
38
56

Thanks for the links geographika. Since my signal is very weak I wouldn't like to discard the background though (this opens up another set of problems with fluctuations anyway). And for shape, I don't really care about it, I just want to find blobs. In the end these ideas would still (need to) go along the lines of greedy or seeded algorithms which are known to have issues here. That's why I am looking for different approaches. – Benjamin Bannier Apr 17 '10 at 18:32
Thanks again for the pointers. But as I wrote, I am already aware of these tools and looking for a totally different direction to attack this (mostly to not to get too used to the limitations of thinking in their terms I guess). I guess the question is just poorly phrased or lacks focus. – Benjamin Bannier Apr 21 '10 at 05:51

Justin · Answer 2 · 2010-05-12T16:17:41.360

0

I'm not sure I see a graph theory analogy, but you can speed things up by pre-computing an area integral. This feels like a multi-scale thing.

A[i,j] = Sum[Sum[p[u,v], {u, 0, i}, {v, 0, j}]];

Then the average brightness of a rectangular (a,b), (c,d) region is

(A[c,d] - (A[c,b] + A[a,d]) + A[a,b])/((c-a)(d-b))

Overflow is probably not your friend if you have big numbers in your cells.

edited May 12 '10 at 16:17

answered May 12 '10 at 16:12

Justin

4,437
6
32
52

Could you explain how this could be useful for identifing the blob? The problem is really that the "signal" is very close to the "background". – Benjamin Bannier May 13 '10 at 19:24
Instead of computing the average value in a k neighborhood of a pixel (which requires k*k memory accesses and additions) you can use 4 memory access and 4 additions. – Justin May 13 '10 at 19:57
This type of thing might work well if you started with larger regions of the image and tried to recursively find areas of similar signal/noise ratio, then apply your area averaging filter. – Justin May 13 '10 at 20:12

score 0 · Answer 3 · answered May 18 '10 at 00:37

Use the union-find algorithm for clustering? It's very fast.

I guess the graph would result from considering each pair of neighboring high-valued cells connected. Use the union-find algorithm to find all clusters, and accept all those above a certain size, perhaps with shape constraints too (eg based on average squared distance from cluster center vs cluster size). It's a trivial variation on the union-find algorithm to collect statistics that you would need for this as you go (count, sum of x, sum of x^2, sum of y, sum of y^2).

score 0 · Answer 4 · answered May 18 '10 at 04:23

0

If you are just looking for a way to translate your problem into a graph problem heres what you could do.

From each point look at all your neighbors (this could be the 8 adjacent squares or the 4 adjacent depending on what you want). Look for the neighbor with the max value, if it is larger than you then connect yourself to this square. if it is smaller then do nothing.

after this you will have a forest or possibly a tree (though i imagine that that is unlikely). This is one possible translation of your matrix to a graph. Im not sure if it is the most useful translation.

answered May 18 '10 at 04:23

luke

14,518
4
46
57

That sounds like a very interesting approach. Would I create an extra tree to account for fluctuations? – Benjamin Bannier May 18 '10 at 19:33
What you described in your answer clusters cells to the neighbor with the highest count. This count can fluctuate, so the combination might not belong to the same cluster. – Benjamin Bannier May 18 '10 at 20:28

Translation of clustering problem to graph theory language

4 Answers4