Creating a heatmap by sampling and bucketing from a 3D array

Question

I have some experimental data that exists like so:

x = array([1, 1.12, 1.109, 2.1, 3, 4.104, 3.1, ...])
y = array([-9, -0.1, -9.2, -8.7, -5, -4, -8.75, ...])
z = array([10, 4, 1, 4, 5, 0, 1, ...])

If it's convenient, we can assume that the data exists as a 3D array or even a pandas DataFrame:

df = pd.DataFrame({'x': x, 'y': y, 'z': z})

The interpretation being, for every position x[i], y[i], the value of some variable is z[i]. These are not evenly sampled, so there will be some parts that are "densely sampled" (e.g. between 1 and 1.2 in x) and others that are very sparse (e.g. between 2 and 3 in x). Because of this, I can't just chuck these into a pcolormesh or contourf.

What I would like to do instead is to resample x and y evenly at some fixed interval and then aggregate the values of z. For my needs, z can be summed or averaged to get meaningful values, so this is not a problem. My naïve attempt was like this:

X = np.arange(min(x), max(x), 0.1)  
Y = np.arange(min(y), max(y), 0.1)
x_g, y_g = np.meshgrid(X, Y)
nx, ny = x_g.shape
z_g = np.full(x_g.shape, np.nan)

for ix in range(nx - 1):
    for jx in range(ny - 1):
        x_min = x_g[ix, jx]
        x_max = x_g[ix + 1, jx + 1]
        y_min = y_g[ix, jx]
        y_max = y_g[ix + 1, jx + 1]
        vals = df[(df.x >= x_min) & (df.x < x_max) & 
                  (df.y >= y_min) & (df.y < y_max)].z.values
        if vals.any():
            z_g[ix, jx] = sum(vals)

This works and I get the output I desire, with plt.contourf(x_g, y_g, z_g) but it is SLOW! I have ~20k samples, which I then subsample into ~800 samples in x and ~500 in y, meaning the for loop is 400k long.

Is there any way to vectorize/optimize this? Even better if there is some function that already does this!

(Also tagging this as MATLAB because the syntax between numpy/MATLAB are very similar and I have access to both software.)

A possible solution in pandas: https://stackoverflow.com/questions/42689070/plotting-2dhistogram-with-sum-value-rather-than-count (although probably not as efficient as the numpy solution below). — ImportanceOfBeingErnest, Aug 20 '17 at 10:23

Divakar · Accepted Answer · 2017-08-20T06:17:08.923

Here's a vectorized Python solution employing NumPy broadcasting and matrix multiplication with np.dot for the sum-reduction part -

x_mask = ((x >= X[:-1,None]) & (x < X[1:,None]))
y_mask = ((y >= Y[:-1,None]) & (y < Y[1:,None]))

z_g_out = np.dot(y_mask*z[None].astype(np.float32), x_mask.T)

# If needed to fill invalid places with NaNs
z_g_out[y_mask.dot(x_mask.T.astype(np.float32))==0] = np.nan

Note that we are avoiding the use of meshgrid there. Thus, saving memory there as the meshes created with meshgrid would be huge and in the process hopefully gaining performance improvement.

Benchmarking

# Original app
def org_app(x,y,z):    
    X = np.arange(min(x), max(x), 0.1)  
    Y = np.arange(min(y), max(y), 0.1)
    x_g, y_g = np.meshgrid(X, Y)
    nx, ny = x_g.shape
    z_g = np.full(np.asarray(x_g.shape)-1, np.nan)

    for ix in range(nx - 1):
        for jx in range(ny - 1):
            x_min = x_g[ix, jx]
            x_max = x_g[ix + 1, jx + 1]
            y_min = y_g[ix, jx]
            y_max = y_g[ix + 1, jx + 1]
            vals = z[(x >= x_min) & (x < x_max) & 
                      (y >= y_min) & (y < y_max)]
            if vals.any():
                z_g[ix, jx] = sum(vals)
    return z_g

# Proposed app
def app1(x,y,z):
    X = np.arange(min(x), max(x), 0.1)  
    Y = np.arange(min(y), max(y), 0.1)
    x_mask = ((x >= X[:-1,None]) & (x < X[1:,None]))
    y_mask = ((y >= Y[:-1,None]) & (y < Y[1:,None]))

    z_g_out = np.dot(y_mask*z[None].astype(np.float32), x_mask.T)

    # If needed to fill invalid places with NaNs
    z_g_out[y_mask.dot(x_mask.T.astype(np.float32))==0] = np.nan
    return z_g_out

As seen, for a fair benchmarking, I am using array values with the original approach, as fetching values from a dataframe could slow things down.

Timings and verification -

In [143]: x = np.array([1, 1.12, 1.109, 2.1, 3, 4.104, 3.1])
     ...: y = np.array([-9, -0.1, -9.2, -8.7, -5, -4, -8.75])
     ...: z = np.array([10, 4, 1, 4, 5, 0, 1])
     ...: 

# Verify outputs
In [150]: np.nansum(np.abs(org_app(x,y,z) - app1(x,y,z)))
Out[150]: 0.0

In [145]: %timeit org_app(x,y,z)
10 loops, best of 3: 19.9 ms per loop

In [146]: %timeit app1(x,y,z)
10000 loops, best of 3: 39.1 µs per loop

In [147]: 19900/39.1  # Speedup figure
Out[147]: 508.95140664961633

rahnema1 · Answer 2 · 2017-08-20T07:55:19.950

1

Here is a MATLAB solution:

X = min(x)-1 :.1:max(x)+1; % the grid needs to be expanded slightly beyond the min and max
Y = min(y)-1 :.1:max(y)+1;
x_o = interp1(X, 1:numel(X), x, 'nearest');
y_o = interp1(Y, 1:numel(Y), y, 'nearest');
z_g = accumarray([x_o(:) y_o(:)], z(:),[numel(X) numel(Y)]);

edited Aug 20 '17 at 07:55

answered Aug 20 '17 at 05:03

rahnema1

15,264
3
15
27

Creating a heatmap by sampling and bucketing from a 3D array

2 Answers2

Benchmarking

Linked