Multivariate KDE Scipy Stats - what if it's not Gaussian?

Question

I have some 2D data that I am smoothing using:

from scipy.stats import gaussian_kde
kde = gaussian_kde(data)

but what if my data isn't Gaussian/tophat/the other options? Mine looks more elliptical before smoothing, so should I really have a different bandwidth in x and then y? The variance in one direction is a lot higher, and also the values of the x axis are higher, so it feels like a simple Gaussian might miss something?

Please, provide a sample of your data. Gaussian Kernel Density Estimation of the Probability Density Function https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html is ok for unimodal distributions, as you can read in the docs, bi/multi-modal distributions are oversmoothed. — Max Pierini, Mar 30 '21 at 12:32
PS: if you're not satisfied of `scipy` Gaussian KDE, you can try with `KDEpy` https://kdepy.readthedocs.io/en/latest/examples.html that let you choose different kernel functions — Max Pierini, Mar 30 '21 at 12:36
Data would be something like: `X = np.random.normal(size=(100,1), loc=1, scale = 0.01), Y = np.random.normal(size=(100,1), loc=200, scale =100)` i.e. a much bigger magnitude in Y and a much wider spread, so I wouldn't want a Gaussian KDE, more elliptical? — Lizardinablizzard, Mar 31 '21 at 11:52
I don't know what do you mean exactly, I'll send an answer with your X and Y — Max Pierini, Mar 31 '21 at 12:47

Max Pierini · Answer 1 · 2021-04-01T08:58:14.450

This is what I get with your defined X and Y. Seems good. Were you expecting something different?

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def generate(n):
    # generate data
    np.random.seed(42)
    x = np.random.normal(size=n, loc=1, scale=0.01)
    np.random.seed(1)
    y = np.random.normal(size=n, loc=200, scale=100)
    return x, y

x, y = generate(100)
xmin = x.min()
xmax = x.max()
ymin = y.min()
ymax = y.max()

X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([x, y])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)

fig, ax = plt.subplots(figsize=(7, 7))
ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
          extent=[xmin, xmax, ymin, ymax],
          aspect='auto', alpha=.75
         )
ax.plot(x, y, 'ko', ms=5)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
plt.show()

The distributions of x and y are Gaussian. You can verify with seaborn too

import pandas as pd
import seaborn as sns
# I pass a DataFrame because passing
# (x,y) alone will be soon deprecated
g = sns.jointplot(data=pd.DataFrame({'x':x, 'y':y}), x='x', y='y')
g.plot_joint(sns.kdeplot, color="r", zorder=0, levels=6)

update

Kernel Density Estimate of 2-dimensional data is done separately along each axis and then join together.

Let's make an example with the dataset we already used.

As we can see in the seaborn jointplot, you have not only the estimated 2d-kde but also marginal distributions of x and y (the histograms).

So, step by step, let's estimate the density of x and y and then evaluate the density over a linearspace

kde_x = sps.gaussian_kde(x)
kde_x_space = np.linspace(x.min(), x.max(), 100)
kde_x_eval = kde_x.evaluate(kde_x_space)
kde_x_eval /= kde_x_eval.sum()

kde_y = sps.gaussian_kde(y)
kde_y_space = np.linspace(y.min(), y.max(), 100)
kde_y_eval = kde_y.evaluate(kde_y_space)
kde_y_eval /= kde_y_eval.sum()

fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].plot(kde_x_space, kde_x_eval, 'k.')
ax[0].set(title='KDE of x')
ax[1].plot(kde_y_space, kde_y_eval, 'k.')
ax[1].set(title='KDE of y')
plt.show()

So we now have the marginal distributions of x and y. These are probability density functions so, the joint-probability of x and y can be seen as the intersection of independent events x and y, thus we can multiply the estimated probability density of x and y in a 2d-matrix and plot on 3d projection

# Grid of x and y
X, Y = np.meshgrid(kde_x_space, kde_y_space)
# Grid of probability density
kX, kY = np.meshgrid(kde_x_eval, kde_y_eval)
# Intersection
Z = kX * kY

fig, ax = plt.subplots(
    2, 2, 
    subplot_kw={"projection": "3d"}, 
    figsize=(10, 10))

for i, (elev, anim, title) in enumerate(zip([10, 10, 25, 25], 
                                            [0, -90, 25, -25],
                                            ['y axis', 'x axis', 'view 1', 'view 2']
                                            )):
    # Plot the surface.
    surf = ax.flat[i].plot_surface(X, Y, Z, cmap=plt.cm.gist_earth_r,
                           linewidth=0, antialiased=False, alpha=.75)
    ax.flat[i].scatter(x, y, zs=0, zdir='z', c='k')
    ax.flat[i].set(
        xlabel='x', ylabel='y',
        title=title
    )
    ax.flat[i].view_init(elev=elev, azim=anim)
plt.show()

This is a very simple and naif method but only to have an idea on how it works and why x and y scales don't matter for a 2d-KDE.

Yes, that does look good thank you, but I'm not sure I understand how! If the KDE puts a Gaussian distribution of width sigma on every data point and then adds them up, if I looked at the x-axis, I'd say sigma was about 0.001 say, but if I looked at the y-axis, I'd say it's about 1. I'm just confused about what approx value of sigma it must be using, as the result does look good, I just wanted to be sure that I didn't need to normalise my data first or something — Lizardinablizzard, Mar 31 '21 at 16:36
You can find here a very good interactive explanation on how KDE works and how different Kernel functions affect the estimate https://mathisonian.github.io/kde/ — Max Pierini, Mar 31 '21 at 16:58
@Lizardinablizzard I added an update about the basics of 2d-KDE explaining why x and y scales don't matter — Max Pierini, Apr 01 '21 at 08:51

Multivariate KDE Scipy Stats - what if it's not Gaussian?

1 Answers1

update