2

For reproducibility, let us simplify my problem as follows: I have 100 spatial polygons representing convex hulls of N random samples drawn from a population (100 times) to calculate the sensitivity of a model to single values. How do I calculate the percentage overlap of these polygons? The ideal solution should be quick and introduce as little approximation as possible.

I have no particular reason to use the GIS capabilities of R, other than I thought this could be the easiest approach to solve the problem.

library(sp)
library(raster)
library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.1.1, PROJ 6.3.1

set.seed(11)

dt <- data.frame(x = rnorm(1e3, 10, 3) + sample(-5:5, 1e3, replace = TRUE))
dt$y <- (rnorm(1e3, 3, 4) + sample(-10:10, 1e3, replace = TRUE)) + dt$x

dt <- rbind(dt, data.frame(x = -dt$x, y = dt$y))

plot(dt, asp = 1)

dt.chull <- dt[chull(dt),]
dt.chull <- rbind(dt.chull, dt.chull[1,])

lines(dt.chull, col = "green")

uncert.polys <- lapply(1:100, function(i) {

tmp <- dt[sample(rownames(dt), 1e2),]

# points(tmp, col = "red")

tmp <- tmp[chull(tmp),]
tmp <- rbind(tmp, tmp[1,])

tmp <- sp::SpatialPolygons(list(sp::Polygons(list(sp::Polygon(tmp)), ID = i)))

sp::SpatialPolygonsDataFrame(tmp, data = data.frame(id = i, row.names = i))

# lines(tmp, col = "red")

})

polys <- do.call(rbind, uncert.polys)

plot(polys, add = TRUE, border = "red")

My initial attempt was to use the sf::st_intersection function:

sf.polys <- sf::st_make_valid(sf::st_as_sf(polys))
all(sf::st_is_valid(sf.polys))
#> [1] TRUE

sf::st_intersection(sf.polys)
#> Error in CPL_nary_intersection(x): Evaluation error: TopologyException: found non-noded intersection between LINESTRING (-9.80706 -0.619557, -7.66331 -3.55177) and LINESTRING (-9.80706 -0.619557, -9.80706 -0.619557) at -9.8070645468969637 -0.61955676978603658.

The error is likely related to polygon lines "that are almost coincident but not identical". Multiple solutions (1, 2) have been suggested to solve this GEOS related problem, none of which I have managed to make work with my data:

sf.polys <- sf::st_set_precision(sf.polys, 1e6) 

sf.polys <- sf::st_snap(sf.polys, sf.polys, tolerance = 1e-4)

sf::st_intersection(sf.polys)
#> Error in CPL_nary_intersection(x): Evaluation error: TopologyException: found non-noded intersection between LINESTRING (-13.7114 32.7341, 3.29417 30.3736) and LINESTRING (3.29417 30.3736, 3.29417 30.3736) at 3.2941702528617176 30.373627946201278.

So, I have to approximate the polygon overlap using rasterization:

GT <- sp::GridTopology(cellcentre.offset = c(round(min(dt$x),1), round(min(dt$y),1)), 
                       cellsize = c(diff(round(range(dt$x), 1))/100, diff(round(range(dt$y), 1))/100),
                       cells.dim = c(100, 100)
)

SG <- sp::SpatialGrid(GT)

tmp <- lapply(seq_along(uncert.polys), function(i) {
  
  out <- sp::over(SG, uncert.polys[[i]])
  out[!is.na(out)] <- 1
  out[is.na(out)] <- 0
  out
})

tmp <- data.frame(overlapping.n = Reduce("+", lapply(tmp, "[[", 1)))
tmp$overlapping.pr <- 100*tmp$overlapping.n/100

uncert.data <- SpatialGridDataFrame(SG, tmp)

## Plot


plot(x = range(dt$x),
     y = range(dt$y), 
     type = "n"
)

plot(raster::raster(uncert.data), col = colorRampPalette(c("white", "red", "blue", "white"))(100), add = TRUE)
plot(polys, add = TRUE, border = adjustcolor("black", alpha.f = 0.2), cex = 0.5)
points(dt, pch = ".", col = "black", cex = 3)
lines(dt.chull, col = "green")

The approach gives results, but the output is approximated and takes a long time to process. There has to be a better way of doing this.

For performance comparison purposes, here is my current solution:

gridOverlap <- function(dt, uncert.polys) {
  GT <- sp::GridTopology(cellcentre.offset = c(round(min(dt$x),1), round(min(dt$y),1)), 
                         cellsize = c(diff(round(range(dt$x), 1))/100, diff(round(range(dt$y), 1))/100),
                         cells.dim = c(100, 100)
  )
  
  SG <- sp::SpatialGrid(GT)
  
  tmp <- lapply(seq_along(uncert.polys), function(i) {
    
    out <- sp::over(SG, uncert.polys[[i]])
    out[!is.na(out)] <- 1
    out[is.na(out)] <- 0
    out
  })
  
  tmp <- data.frame(overlapping.n = Reduce("+", lapply(tmp, "[[", 1)))
  tmp$overlapping.pr <- 100*tmp$overlapping.n/100
  
  SpatialGridDataFrame(SG, tmp)
}

system.time(gridOverlap(dt = dt, uncert.polys = uncert.polys))
#   user  system elapsed 
#   3.011   0.083   3.105 

The performance matters for larger datasets (this solution takes several minutes in the real application).

Created on 2020-09-01 by the reprex package (v0.3.0)

Mikko
  • 7,530
  • 8
  • 55
  • 92

2 Answers2

3

Here is a solution to finding the interior without any errors using spatstat and the underlying polyclip package.

library(spatstat)

# Data from OP
set.seed(11)
dt <- data.frame(x = rnorm(1e3, 10, 3) + sample(-5:5, 1e3, replace = TRUE))
dt$y <- (rnorm(1e3, 3, 4) + sample(-10:10, 1e3, replace = TRUE)) + dt$x
dt <- rbind(dt, data.frame(x = -dt$x, y = dt$y))

# Converted to spatstat classes (`ppp` not strictly necessary -- just a habit)
X <- as.ppp(dt, W = owin(c(-25,25),c(-15,40)))
p1 <- owin(poly = dt[rev(chull(dt)),])

# Plot of data and convex hull
plot(X, main = "")
plot(p1, add = TRUE, border = "green")

# Convex hulls of sampled points in spatstat format
polys <- lapply(1:100, function(i) {
  tmp <- dt[sample(rownames(dt), 1e2),]
  owin(poly = tmp[rev(chull(tmp)),])
})

# Plot of convex hulls
for(i in seq_along(polys)){
  plot(polys[[i]], add = TRUE, border = "red")
}

# Intersection of all convex hulls plotted in transparent blue
interior <- do.call(intersect.owin, polys)
plot(interior, add = TRUE, col = rgb(0,0,1,0.1))

It is not clear to me what you want to do from here, but at least this approach avoids the errors of polygon clipping.

To do the grid based solution in spatstat I would convert the windows to binary image masks and then work from there:

Wmask <- as.im(Window(X), dimyx = c(200, 200))
masks <- lapply(polys, as.im.owin, xy = Wmask, na.replace = 0)
maskmean <- Reduce("+", masks)/100
plot(maskmean)

The speed depends on the resolution you choose, but I would guess it is much faster than the current suggestion using sp/raster (which can probably be improved a lot using the same logic as here, so that would be another option to stick to raster).

Ege Rubak
  • 4,347
  • 1
  • 10
  • 18
  • I was just planning on placing a bounty on this question but you were quicker ;) spatstat seems indeed better performance wise. Thanks Ege! I need to calculate the percentage overlap of the polygons to illustrate the "robustness" of convex hull estimation. It's been a while I used spatstat. Would you know how to do it? I.e. output should be something like the `uncert.data` object in my example, but preferably vector instead of a grid. I managed to make the grid work in my application too, so if the performance issue is fixed, I'd be happy. – Mikko Sep 09 '20 at 08:31
  • I'll start the bounty anyway to motivate you to modify your answer ;) It might take a couple of rounds. – Mikko Sep 09 '20 at 08:32
  • I'm not sure how to even do this vector based from a algorithmic point of view. What is the procedure? Is it enough to find all pairwise intersections and the construct your desired results from there? When I see your example I would definitely go for a grid based solution my self. – Ege Rubak Sep 09 '20 at 08:44
  • Grid is fine as long as calculating it does not take as long as in my current solution. One could always pump up the grid-resolution if the performance was better. – Mikko Sep 09 '20 at 08:46
  • 1
    On my machine the above grid solution was a factor 20 faster than your original code. I hope that helps. – Ege Rubak Sep 09 '20 at 19:31
1

Edit Reworked a possibly better solution further below.

Been thinking on this one for a bit, and my inclination is something like a triangulation and dynamic programming approach could work well.

  1. Consider the points and lines for each convex hull. Label them as what hull they belong to (probably store in a lookup)
  2. Take the points from all lines and triangulate them, these triangles will be noted as to how many convex hulls they are within.
  3. At this point there are quite a few ways you can determine how many convex hulls the triangle is in. The examples you showed lean towards some possible optimizations, but as a general solution the best route is probably to just loop over each triangle and see which hulls it are in, O(T*H).
  4. Should be possible to note the points/edges/triangles and work out which hulls each are within (especially what hulls the left and right of each edge are in, which can then be used to determine which hulls are within each triangle (set union of which hulls are are in the inside side of the line), and from that get a count of the number of hulls the triangle is in. Tricky bit is how to cascade the information without taking O(T*H). Will think on more and reply later.

Edit with Better approach

Should their intersection be added to the list of points to be triangulated? Reducing the ambiguity. That technique is a linescan algorithm especially that for detecting intersections in O(Nlog(N)) time, such as the https://en.wikipedia.org/wiki/Bentley%E2%80%93Ottmann_algorithm

So here is an updated method that is a bit more straightforward Included an example image below (appears smaller than expected...) 3 convex hulls example

The image above shows 3 convex hulls, and has numbers for a sweep line crossing each point left to right. Though really Andrew's Algorithm for convex hulls avoids the need for an actual sweep line since one is part of the algorithm. Basically you use Andrew's algorithm to build all the hulls in one go, but with duplicates.

So basic process looks like this:

  1. Setup empty lists for each known hull (G/R/B: Green, Red, Black), upper and lower hulls. So a mapping of each point to the hulls they are in (initialize as empty lists).
  2. Sort all points (within the convex hulls) using the sort order of Andrew's algorithm.
  3. Using the same sort order as Andrew's algorithm, add each point to each hull (upper and lower).
  4. We then use Andrew's algorithms to consider points. The trick though, is that we already know what the hull will be. Consider the Red hull, points 2,7 and 8. And the other points 4 and 5 (5 is actually 2 points, forgot a label). 4 would be added as a hull point, but since we are focusing on the Red hull, we just ignore 4 (since it is not inside of the grey hull). Same applies if multiple hulls use the same point since that point isn't technically inside of any of those hulls (unless you want to consider it so, in which case all hull points are within at least 1 hull, this might be useful to do for the visual benefit and I think it's the only way to make the intersection coloring practical). However, the two 5 points are within the grey hull, so we note that that they are both inside of the Red hull. Performance of this overall is roughly O(N*C) where N is the number of points and C is the number of hulls. I imagine this can probably be dropped to something like O(C log N + N log C) or something with enough effort, but may not be worth it.

You can run set intersections to find all the intersections, then use them to build polygons for more exact coloring. However, this makes things a lot messier, and I'm still trying to work out a good solution that. However, I suspect, counting a point as it being "within it's own hull" may greatly help with that. In which case, you can probably just take the min of the points that make up the polygon. So if you had points within 1/2/2/2 hulls, then that area is within 1 hull.

I would first test this in the situation where no point is in multiple hulls. Then adjust the logic to support multiple hulls.

For best performance, I would only run this algorithm on the actual hull points, then just overlay the results (color-coded polygons if you went the line segment route) on top of the actual dataset if you need to. If you didn't go the color coded polygon route, then I would probably color polygons based on the average number of hulls they are within or perhaps run the algorithm using all points (not just hull points), but that's going to be a massive performance hit. Probably better to just do the work for line segments.

Nuclearman
  • 5,029
  • 1
  • 19
  • 35
  • This sounds interesting, but I'm not sure I understand what to triangulate in the first place. We have a point cloud of 1000 points. The we subsample the points a bunch of times and get a list of convex hulls. Each hull is effectively a small list of vertices and lines connecting these. Lines from different hulls cross each other. Should their intersection be added to the list of points to be triangulated? How are the lines forced to be edges in the triangulation? Maybe a sketch using small number of points and hulls would help understand your strategy. – Ege Rubak Sep 15 '20 at 08:26
  • Hmmm I would say yes to triangulating the intersections. I think I over-complicated this perhaps. I'm going to post an update here soon enough. – Nuclearman Sep 16 '20 at 06:12
  • @EgeRubak Updated the answer to include a line-segment approach that is O(N * C) where N is the number of points and C is the number of convex hulls. You can likely only use the points within the actual convex hulls (so skipping the mass of points in the center). – Nuclearman Sep 16 '20 at 07:38