3

I'm working with 2D geographical data. I have a long list of contour paths. Now I want to determine for every point in my domain inside how many contours it resides (i.e. I want to compute the spatial frequency distribution of the features represented by the contours).

To illustrate what I want to do, here's a first very naive implementation:

import numpy as np
from shapely.geometry import Polygon, Point

def comp_frequency(paths,lonlat):
    """
    - paths: list of contour paths, made up of (lon,lat) tuples
    - lonlat: array containing the lon/lat coordinates; shape (nx,ny,2)
    """
    frequency = np.zeros(lonlat.shape[:2])
    contours = [Polygon(path) for path in paths]

    # Very naive and accordingly slow implementation
    for (i,j),v in np.ndenumerate(frequency):
        pt = Point(lonlat[i,j,:])
        for contour in contours:
            if contour.contains(pt):
                frequency[i,j] += 1

    return frequency

lon = np.array([
    [-1.10e+1,-7.82+0,-4.52+0,-1.18+0, 2.19e+0,5.59e+0,9.01+0,1.24+1,1.58+1,1.92+1,2.26+1],
    [-1.20e+1,-8.65+0,-5.21+0,-1.71+0, 1.81e+0,5.38e+0,8.97+0,1.25+1,1.61+1,1.96+1,2.32+1],
    [-1.30e+1,-9.53+0,-5.94+0,-2.29+0, 1.41e+0,5.15e+0,8.91+0,1.26+1,1.64+1,2.01+1,2.38+1],
    [-1.41e+1,-1.04+1,-6.74+0,-2.91+0, 9.76e-1,4.90e+0,8.86+0,1.28+1,1.67+1,2.06+1,2.45+1],
    [-1.53e+1,-1.15+1,-7.60+0,-3.58+0, 4.98e-1,4.63e+0,8.80+0,1.29+1,1.71+1,2.12+1,2.53+1],
    [-1.66e+1,-1.26+1,-8.55+0,-4.33+0,-3.00e-2,4.33e+0,8.73+0,1.31+1,1.75+1,2.18+1,2.61+1],
    [-1.81e+1,-1.39+1,-9.60+0,-5.16+0,-6.20e-1,3.99e+0,8.66+0,1.33+1,1.79+1,2.25+1,2.70+1],
    [-1.97e+1,-1.53+1,-1.07+1,-6.10+0,-1.28e+0,3.61e+0,8.57+0,1.35+1,1.84+1,2.33+1,2.81+1],
    [-2.14e+1,-1.69+1,-1.21+1,-7.16+0,-2.05e+0,3.17e+0,8.47+0,1.37+1,1.90+1,2.42+1,2.93+1],
    [-2.35e+1,-1.87+1,-1.36+1,-8.40+0,-2.94e+0,2.66e+0,8.36+0,1.40+1,1.97+1,2.52+1,3.06+1],
    [-2.58e+1,-2.08+1,-1.54+1,-9.86+0,-3.99e+0,2.05e+0,8.22+0,1.44+1,2.05+1,2.65+1,3.22+1]])

lat = np.array([
    [ 29.6, 30.3, 30.9, 31.4, 31.7, 32.0, 32.1, 32.1, 31.9, 31.6, 31.2],
    [ 32.4, 33.2, 33.8, 34.4, 34.7, 35.0, 35.1, 35.1, 34.9, 34.6, 34.2],
    [ 35.3, 36.1, 36.8, 37.3, 37.7, 38.0, 38.1, 38.1, 37.9, 37.6, 37.1],
    [ 38.2, 39.0, 39.7, 40.3, 40.7, 41.0, 41.1, 41.1, 40.9, 40.5, 40.1],
    [ 41.0, 41.9, 42.6, 43.2, 43.7, 44.0, 44.1, 44.0, 43.9, 43.5, 43.0],
    [ 43.9, 44.8, 45.6, 46.2, 46.7, 47.0, 47.1, 47.0, 46.8, 46.5, 45.9],
    [ 46.7, 47.7, 48.5, 49.1, 49.6, 49.9, 50.1, 50.0, 49.8, 49.4, 48.9],
    [ 49.5, 50.5, 51.4, 52.1, 52.6, 52.9, 53.1, 53.0, 52.8, 52.4, 51.8],
    [ 52.3, 53.4, 54.3, 55.0, 55.6, 55.9, 56.1, 56.0, 55.8, 55.3, 54.7],
    [ 55.0, 56.2, 57.1, 57.9, 58.5, 58.9, 59.1, 59.0, 58.8, 58.3, 57.6],
    [ 57.7, 59.0, 60.0, 60.8, 61.5, 61.9, 62.1, 62.0, 61.7, 61.2, 60.5]])

lonlat = np.dstack((lon,lat))

paths = [
    [(-1.71,34.4),(1.81,34.7),(5.15,38.0),(4.9,41.0),(4.63,44.0),(-0.03,46.7),(-4.33,46.2),(-9.6,48.5),(-8.55,45.6),(-3.58,43.2),(-2.91,40.3),(-2.29,37.3),(-1.71,34.4)],
    [(0.976,40.7),(-4.33,46.2),(-0.62,49.6),(3.99,49.9),(4.33,47.0),(4.63,44.0),(0.976,40.7)],
    [(2.9,55.8),(2.37,56.0),(8.47,56.1),(3.17,55.9),(-2.05,55.6),(-1.28,52.6),(-0.62,49.6),(4.33,47.0),(8.8,44.1),(2.29,44.0),(2.71,43.9),(3.18,46.5),(3.25,49.4),(3.33,52.4),(2.9,55.8)],
    [(2.25,35.1),(2.26,38.1),(8.86,41.1),(5.15,38.0),(5.38,35.0),(9.01,32.1),(2.25,35.1)]]

frequency = comp_frequency(paths,lonlat)

Of course this is about as inefficiently written as possible, with all the explicit loops, and accordingly takes forever.

How can I do this efficiently?

Edit: Added some sample data on request. Note that my real domain is 150**2 larger (in terms of resolution), as I've created the sample coordinates by slicing the original arrays: lon[::150].

flotzilla
  • 1,181
  • 1
  • 13
  • 23
  • could you add some of your data to your post ? I mean paths and longlat – LetzerWille Sep 29 '15 at 13:50
  • @LetzerWille: Ok, I've added some data (reduced coordinate arrays and some made-up contours). – flotzilla Sep 29 '15 at 14:33
  • Do you have the original grid that the contours were generated from? In this case, it's actually much more efficient to sample the grid and infer which contours it's inside than it is to calculate polygon intersections. – Joe Kington Sep 29 '15 at 14:54
  • @JoeKington: What exactly do you mean by "sample the grid and infer which contours it's inside"? Anyway, I guess by "original grid" you mean the original input field. Yes, I do have it, but it is not a single field, but many of those, to be exact one every hour for at least one year. From these I have extracted certain contours (maybe 4 on average per field), and this data set I am now post-processing. – flotzilla Sep 29 '15 at 16:38
  • @flotzilla - Either way, sampling the input grids should be faster than doing multiple complex polygon intersections. By sampling the values of the input grids at the given point, you can directly infer which contours the point would be inside. – Joe Kington Sep 29 '15 at 16:44
  • @JoeKington: If my current approach turns out to be unreasonably expensive, I could also do it in on the fly in the original program (but even in that case, I'm interested in a solution to the problem). However I still don't really get what you specifically mean by "sampling the values of the input grids at the given point", could you please quickly elaborate on this? (I'm also happy with e.g. a wikipedia link in case this "sampling' is a well-defined procedure I'm just not familiar with.) Thanks! – flotzilla Sep 29 '15 at 16:52
  • @flotzilla - By "sample" I mean "interpolate the value of the grid at the given point". – Joe Kington Sep 29 '15 at 16:58

3 Answers3

4

If your input polygons are actually contours, then you're better off working directly with your input grids than calculating contours and testing if a point is inside them.

Contours follow a constant value of gridded data. Each contour is a polygon enclosing areas of the input grid greater than that value.

If you need to know how many contours a given point is inside, it's faster to sample the input grid at the point's location and operate the returned "z" value. The number of contours that it's inside can be extracted directly from it if you know what values you created contours at.

For example:

import numpy as np
from scipy.interpolate import RegularGridInterpolator
import matplotlib.pyplot as plt

# One of your input gridded datasets
y, x = np.mgrid[-5:5:100j, -5:5:100j]
z = np.sin(np.hypot(x, y)) + np.hypot(x, y) / 10

contour_values = [-1, -0.5, 0, 0.5, 1, 1.5, 2]

# A point location...
x0, y0 = np.random.normal(0, 2, 2)

# Visualize what's happening...
fig, ax = plt.subplots()
cont = ax.contourf(x, y, z, contour_values, cmap='gist_earth')
ax.plot([x0], [y0], marker='o', ls='none', color='salmon', ms=12)
fig.colorbar(cont)

# Instead of working with whether or not the point intersects the
# contour polygons we generated, we'll turn the problem on its head:

# Sample the grid at the point location
interp = RegularGridInterpolator((x[0,:], y[:,0]), z)
z0 = interp([x0, y0])

# How many contours would the point be inside?
num_inside = sum(z0 > c for c in contour_values)[0]

ax.set(title='Point is inside {} contours'.format(num_inside))
plt.show()

enter image description here

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • Thanks for this alternative approach, this might come in handy once I do all the analysis online, which I will at some point in the future. It doesn't really help with the problem at hand, though. To do the flexible kind of analysis I'm currently doing, I'd have to save a full 2D field (binary would do it, but still) for basically every contour, and this is just not feasible given the amount of data. Then I'd rather only save the contour coordinates (as I'm doing) and wait a bit longer for the analysis to complete. – flotzilla Oct 01 '15 at 07:22
  • 1
    @flotzilla - You wouldn't need to save it for every contour. Instead you're sampling the data the contours are generated from. – Joe Kington Oct 01 '15 at 13:06
1

So meanwhile I've found a good solution thanks to a colleague who's implemented something similar at some point (based on this blog post).

Old, very slow approach using shapely

First, here's my reference implementation using shapely, which is just a somewhat refined version of my first "naive" approach in the opening post. It is easy to understand and works, but is really slow.

import numpy as np
from shapely.geometry import Polygon, Point

def spatial_contour_frequency_shapely(paths,lon,lat):

    frequency = np.zeros(lon.shape)
    contours = [Polygon(path) for path in paths]

    for (i,j),v in np.ndenumerate(frequency):
        pt = Point([lon[i,j],lat[i,j]])
        for contour in contours:
            if contour.contains(pt):
                frequency[i,j] += 1

    return frequency

New, very fast solution using PIL

My (almost-) final solution doesn't use shapely anymore, but instead uses image manipulation techniques from PIL (Python Imaging Library). This solution is much, much, much faster, although in this form only for regular, rectangular grids (see comment below).

import numpy as np
from PIL import Image, ImageDraw

def _spatial_contour_frequency_pil(paths,lon,lat,regular_grid=False,
        method_ind=None):

    def get_indices(points,lon,lat,tree=None,regular=False):

        def get_indices_regular(points,lon,lat):
            lon,lat = lon.T,lat.T
            def _get_ij(lon,lat,x,y):
                lon0 = lon[0,0]
                lat0 = lat[0,0]
                lon1 = lon[-1,-1]
                lat1 = lat[-1,-1]
                nx,ny = lon.shape
                dx = (lon1-lon0)/nx
                dy = (lat1-lat0)/ny
                i = int((x-lon0)/dx)
                j = int((y-lat0)/dy)
                return i,j
            return [_get_ij(lon,lat,x,y) for x,y in points]

        def get_indices_irregular(points,tree,shape):

            dist,idx = tree.query(points,k=1)
            ind = np.column_stack(np.unravel_index(idx,lon.shape))
            return [(i,j) for i,j in ind]

        if regular:
            return get_indices_regular(points,lon,lat)
        return get_indices_irregular(points,tree,lon.T.shape)

    tree = None
    if not regular_grid:
        lonlat = np.column_stack((lon.T.ravel(),lat.T.ravel()))
        tree = sp.spatial.cKDTree(lonlat)

    frequency = np.zeros(lon.shape)
    for path in paths:
        path_ij = get_indices(path,lon,lat,tree=tree,regular=regular_grid)
        raster_poly = Image.new("L",lon.shape,0)
        rasterize = ImageDraw.Draw(raster_poly)
        rasterize.polygon(path_ij,1)
        mask = sp.fromstring(raster_poly.tobytes(),'b')
        mask.shape = raster_poly.im.size[1],raster_poly.im.size[0]
        frequency += mask

    return frequency

It should be noted that the result of these two approaches is not exaclty the same. The features identified with the PIL-approach are slightly larger than those identified with the shapely approach, but really one is not better than the other.

Timings

Here are some timings created with a reduced data set (not the semi-artificial example data from the opening post, though):

spatial_contour_frequency/shapely             :   191.8843
spatial_contour_frequency/pil                 :     0.3287
spatial_contour_frequency/pil-tree-inside     :     2.3629
spatial_contour_frequency/pil-regular_grid    :     0.3276

The most time-consuming step is finding the indices on the irregular lon/lat grid of the contour points. The most time-consumin part thereof is the construction of the cKDTree, which is why I've moved it out of get_indices. To put this into perspective, pil-tree-inside is the version where the tree is created inside of get_indices. pil-regular-grid is with regular_grid=True, which, for my data set, yields wrong results, but gives an ides of how fast this would run on a regular grid.

Overall now I've managed to pretty much eliminate the effect of the non-regular grid (pil vs. pil-regular-grid), which is all I could hope for in the beginning! :)

flotzilla
  • 1,181
  • 1
  • 13
  • 23
0
def comp_frequency_lc(paths,lonlat):

    import operator
    frequency = np.zeros(lonlat.shape[:2])
    contours = [Polygon(path) for path in paths]

    for (i,j),v in np.ndenumerate(frequency):
        pt = Point(lonlat[i,j,:])
        [
            operator.setitem(frequency,(i,j),
                    operator.getitem(frequency,(i,j))+1)
            for contour in contours if contour.contains(pt)
         ]

    return frequency

    print(comp_frequency(paths,lonlat))

**result in**:

[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  1.  2.  2.  2.]
 [ 0.  1.  0.  0.  1.  0.  0.  1.  1.  1.  1.]
 [ 0.  2.  0.  0.  2.  0.  0.  2.  2.  2.  1.]
 [ 0.  2.  0.  0.  1.  0.  0.  1.  1.  1.  2.]
 [ 0.  1.  0.  0.  0.  0.  0.  1.  2.  1.  1.]
 [ 0.  1.  1.  0.  0.  0.  0.  1.  1.  0.  0.]
 [ 0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

here are the results from timing.... where comp_frequency_lc is one that uses list comprehension

# 10000 runs
# comp_frequency    185.10419257196094
# comp_frequency_lc 124.42639064462102
flotzilla
  • 1,181
  • 1
  • 13
  • 23
LetzerWille
  • 5,355
  • 4
  • 23
  • 26
  • Thanks, I'll try it out tomorrow. But I kinda doubt that this will have a huge effect, given the explicit loop have been retained in principle... I guess I'll need a much more pure numpy/shapely solution. And I really do need a huge speedup, because with the original amount of data my initial "solution" might take hours (I've aborted it after like 20min).. – flotzilla Sep 29 '15 at 16:46
  • I've now started comparing approaches (mostly to test a better version of my initial one) and have also tested yours. However I don't get any speedup at all, but rather a small slowdown, namely from 74.0s to 74.3s for 10000 repetitions! I've also tried to include the for loop over "frequency" into the comprehension, as well, but then I even get a slowdown to 154s. Looks like I can safely drop this approach. (Thanks anyway for the suggestion, of course!) – flotzilla Sep 30 '15 at 10:24