1

I created a simple seaborn kde plots and wonder whether this is a bug.

My code is:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.kdeplot(np.array([1,2]), cmap="Reds",  shade=True,  bw=0.01)
sns.kdeplot(np.array([2.4,2.5]), cmap="Blues", shade=True,  bw=0.01)
plt.show()

The blue and red lines show the kde's of 2 points. If the points are close together, the densities are much narrower compared to the points being further apart. I find this very counter intuitive, at least to the extent that can be seen. I am wondering whether this might be a bug. I also could not find a resource describing how the densities are computed from a set of given points. Any help is appreciated.

Plot shows the result of the above code

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • Dear JohanC, thanks for your reply! My problem is: Why do 2 separate points yield a result with two broad bumps, while 2 points close together yield 2 narrow peaks. May understanding of the method was: Each point is represented by a gaussian distribution (bump) and all these contributions add up to become the finals probability density. But then I would expect that the two closer points do not get a much narrower distribution than the two more stanstant points. – Eddy-Python Mar 18 '22 at 09:55
  • For both lines I specified the same bw factor so why does the effect depend on the distance? – Eddy-Python Mar 18 '22 at 10:02
  • Many tanks! This creates that output that I expected from the very beginning. Somehow I am still surprised why the data is scaled. I would not have expected this from my understanding of a kde. I thought every data point simply gets a gaussian assigned and the sum of all gaussians is the result without some hidden scaling of the bw. I wounder what the reason is, that the bw is scaled. Are there kde producing libraries that do not skale kde? Many thanks again. – Eddy-Python Mar 19 '22 at 13:54
  • I understand that in principle bw_method=0.01/np.std(data1) solves the problem. With one exception. I want to plot kde for data that in some cases can consist of a single point. Then I would expect simply one gaussian bump. But I get the error message that the std cannot be computed. Well, this is consitent with the behaviour of scipy.stats.gaussian_kde which needs the std to compute the bw. To avoid this problem, I would be interested in a kde plotter that does not need the std in the first place. Then I could also use it to plot a single bump for a singel data point. – Eddy-Python Mar 19 '22 at 14:36
  • 1
    [kde] as a tag refers to the desktop environment, doesn't refer to kernel density estimation. – gboffi Mar 21 '22 at 14:43

1 Answers1

1

The bw_method= (called bw= in older versions), is directly passed to scipy.stats.gaussian_kde. The docs there write "If a scalar, this will be used directly as kde.factor". The explanation of kde.factor tells "The square of kde.factor multiplies the covariance matrix of the data in the kde estimation." So, it is a kind of scaling factor. If still more details are needed, you could dive into scipy's source code, or into the research papers referenced in the docs.

If you really want to counter the scaling, you could divide it away: sns.kdeplot(np.array(data), ..., bw_method=0.01/np.std(data)).

Or you could create your own version of a gaussian kde, with a bandwidth in data coordinates. It just sums some gauss curves and normalizes (total area under the curve should be 1) via dividing by the number of curves.

Here is some example code, with kde curves for 1, 2 or 20 input points:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def gauss(x, mu=0.0, sigma=1.0):
    return np.exp(-((x - mu) / sigma) ** 2 / 2) / (sigma * np.sqrt(2 * np.pi))

def kde(xs, data, sigma=1.0):
    return gauss(xs.reshape(-1, 1), data.reshape(1, -1), sigma).sum(axis=1) / len(data)

sns.set()
sigma = 0.03
xs = np.linspace(0, 4, 300)
fig, ax = plt.subplots(figsize=(12, 5))

data1 = np.array([1, 2])
kde1 = kde(xs, data1, sigma=sigma)
ax.plot(xs, kde1, color='crimson', label=f'dist of 1, σ={sigma}')
ax.fill_between(xs, kde1, color='crimson', alpha=0.3)

data2 = np.array([2.4, 2.5])
kde2 = kde(xs, data2, sigma=sigma)
ax.plot(xs, kde2, color='dodgerblue', label=f'dist of 0.1, σ={sigma}')
ax.fill_between(xs, kde2, color='dodgerblue', alpha=0.3)

data3 = np.array([3])
kde3 = kde(xs, data3, sigma=sigma)
ax.plot(xs, kde3, color='limegreen', label=f'1 point, σ={sigma}')
ax.fill_between(xs, kde3, color='limegreen', alpha=0.3)

data4 = np.random.normal(0.01, 0.1, 20).cumsum() + 1.1
kde4 = kde(xs, data4, sigma=sigma)
ax.plot(xs, kde4, color='purple', label=f'20 points, σ={sigma}')
ax.fill_between(xs, kde4, color='purple', alpha=0.3)

ax.margins(x=0)  # remove superfluous whitespace left and right
ax.set_ylim(ymin=0)  # let the plot "sit" onto y=0
ax.legend()
plt.show()

kde curves with bandwidth in data coordinates

JohanC
  • 71,591
  • 8
  • 33
  • 66