How to speed up calculation of data density in python (following Matlab datadensity.m)

Question

In order to create a data density image I am follwing the calculations proposed in the Matlab code datadensity.m. It seemded a lot more straightforward than any [python codes][1] I found. However, it takes an incredible long amount of time to calculate the datapoints. Is there any way to speed things up? Is there a more efficient method using python syntax and/or speeding up the for-loops? My x and y data have many thousands of datapoints.

Here is my code:

# create random data
df_density = pd.DataFrame(np.random.randn(100000, 2), columns=list('xy'))

# width, height - dimensions of the density plot
width = 256
height = 256
# minimum and maximum of the input data
limits_1 = min(df_density.x)
limits_2 = max(df_density.x)
limits_3 = min(df_density.y)
limits_4 = max(df_density.y)
# resolution
deltax = (limits_2 - limits_1) / width
deltay = (limits_4 - limits_3) / height
# amount of smear, defaults to size of pixel diagonal
fudge = math.sqrt(deltax**2 + deltay**2)

dmap = np.zeros((height, width))
for ii in range(height-1):
    yi = limits_3 + ii * deltay + deltay/2
    for jj in range(width-1):
        xi = limits_1 + jj * deltax + deltax/2
        dd = 0
        for kk in range(len(df_density)):
            dist2 = (df_density.x[kk] - xi)**2 + (df_density.y[kk] - yi)**2
            dd = dd + 1 / (dist2 + fudge)               
        dmap[ii,jj] = dd

[1]:e.g. Efficient method of calculating density of irregularly spaced points

score 1 · Answer 1 · answered Mar 04 '15 at 15:55

First, you should use range(width) and range(height), not range(width-1) and range(height-1). This is because Matlab includes the last element of a range, while python does not.

As for performance, there are a bunch of things you can do.

First, don't use the python builtin min and max functions. Since you are using pandas, use the pandas version:

limits_1 = min(df_density.x)
limits_2 = max(df_density.x)
limits_3 = min(df_density.y)
limits_4 = max(df_density.y)
# resolution
deltax = (limits_2 - limits_1) / width
deltay = (limits_4 - limits_3) / height
# amount of smear, defaults to size of pixel diagonal
fudge = np.sqrt(deltax**2 + deltay**2)

execution time: 34.5 ms

# minimum and maximum of the input data
mins = df_density.min()
maxs = df_density.max()
# resolution
deltas = maxs-mins
deltax = deltas.x/width
deltay = deltas.y/height
# amount of smear, defaults to size of pixel diagonal
fudge = math.sqrt(deltax**2 + deltay**2)

execution time: 1.96 ms

You can also vectorize the computations:

for ii in range(height-1):
    yi = limits_3 + ii * deltay + deltay/2
    for jj in range(width-1):
        xi = limits_1 + jj * deltax + deltax/2

execution time: 47.1 ms

%%timeit
yis = limits_3 + np.arange(height-1)*deltay + deltay/2
xis = limits_1 + np.arange(width-1)*deltax + deltax/2

execution time: 20.3 µs (that is a factor of more than 2000 speedup).

And make it even faster by making sure most computations happen on scalars:

%%timeit
yis = limits_3+deltay/2+deltay*np.arange(height-1)
xis = limits_1+deltax/2+deltax*np.arange(height-1)

execution time: 14.2 µs

However, the final loop is going to be slow no matter what you do, since you will likely run out of memory if you try to vectorize the whole thing. It can be partially vectorized, which speeds things up considerably (although it still takes a few minutes to run):

# create random data
df_density = pd.DataFrame(np.random.randn(100000, 2), columns=list('xy'))

# width, height - dimensions of the density plot
width = 256
height = 256

# minimum and maximum of the input data
df_max = df_density.max()
df_min = df_density.min()
x_min = df_min.x
y_min = df_min.y

# resolution
deltas = df_max-df_min
deltax = deltas.x/width
deltay = deltas.y/height
# amount of smear, defaults to size of pixel diagonal
fudge = np.sqrt(deltax**2 + deltay**2)

dmap = np.zeros((height, width))
yis = y_min+deltay/2+deltay*np.arange(height)
xis = x_min+deltax/2+deltax*np.arange(width)

yiss, xiss = np.meshgrid(xis, yis)
for x, y in df_density.values:
    dmap+=1./(fudge+(x-xiss)**2+(y-yiss)**2)

How to speed up calculation of data density in python (following Matlab datadensity.m)

1 Answers1