0

I have been using kernel density estimation for a while, but so far I always escaped the easy way by just analysing and normalised distributions where intercomparisons between different sets were not necessary. In my current project I want to compare 2D density distributions on absolute scales and it seems I have missed a critical point on how KDE works. I need to compare stellar densities on the sky from two different data sets and for this I would need either absolute numbers (in stars per some area) or I could just directly compare the two calculated density estimates. To illustrate my problem, have a look at this code:

# Import stuff
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.ticker import MultipleLocator

# Define kernel
kernel = KernelDensity(kernel="gaussian", bandwidth=1)

# Set some parameters for the synthetic data
mean = [0, 0]
cov = [[0.2, 1], [0, 1]]

# Create two data sets with different densities
x1, y1 = np.random.multivariate_normal(mean,cov,100).T
x2, y2 = np.random.multivariate_normal(mean,cov,1000).T

# Create grid
xgrid = np.arange(-5, 5, 0.1)
ygrid = np.arange(-5, 5, 0.1)
xy_coo = np.meshgrid(xgrid, ygrid)
grid = np.array([xy_coo[0].reshape(-1), xy_coo[1].reshape(-1)])

# Prepare data
data1 = np.vstack([x1, y1])
data2 = np.vstack([x2, y2])

# Evaluate density
log_dens1 = kernel.fit(data1.T).score_samples(grid.T)
dens1 = np.exp(log_dens1).reshape([len(xgrid), len(ygrid)])
log_dens2 = kernel.fit(data2.T).score_samples(grid.T)
dens2 = np.exp(log_dens2).reshape([len(xgrid), len(ygrid)])

# Plot the distributions and densities
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

im1 = ax1.imshow(dens1, extent=[-5, 5, -5, 5], origin="lower", vmin=0, vmax=0.1)
ax1.scatter(x1, y1, s=1, marker=".")
divider1 = make_axes_locatable(ax1)
cax1 = divider1.append_axes("top", size="10%", pad=0.4)
cbar1 = plt.colorbar(im1, cax=cax1, orientation="horizontal", ticks=MultipleLocator(0.02), format="%.2f")

im2 = ax2.imshow(dens2, extent=[-5, 5, -5, 5], origin="lower", vmin=0, vmax=0.1)
ax2.scatter(x2, y2, s=1, marker=".")
divider2 = make_axes_locatable(ax2)
cax2 = divider2.append_axes("top", size="10%", pad=0.4)
cbar2 = plt.colorbar(im2, cax=cax2, orientation="horizontal", ticks=MultipleLocator(0.02), format="%.2f")

plt.show()

Now, the above image is an example of the results obtained with this code. The code just generates two datasets: One set with 100 sources, the other one with 1000 sources. Their distribution is shown in the plots as scattered points. Then the code evaluates the kernel density on a given grid. This kernel density is shown in the background of the images with colors. Now what puzzles me is that the densities I get (the values of the color in the colorbar) are almost the same for both distributions, even though I have 10 times more sources in the second set. This makes it impossible to compare the density distributions directly to each other.

My questions:

a ) How exactly are the densities normalised? By number counts?

b) Is there any way to get an absolute density estimation from the KDE? Say sources per 1x1 box in these arbitrary units?

thanks

HansSnah
  • 2,160
  • 4
  • 18
  • 31

1 Answers1

0

KDE is a non-parametric estimation of the probability density function, so the sum of probabilities must equal to 1. You can think of it as a smoothed histogram normalized by the number of observations.

So, to get the absolute number, you just need to multiply back the number of observations.

Jianxun Li
  • 24,004
  • 10
  • 58
  • 76