Seaborn KDEPlot - not enough variation in data?

Question

I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:

y = df.sort_values(by='col1', ascending=False)['col1'].values[:600]
sb.kdeplot(y)

But including 451 of the minimum values gives a very different output:

y = df.sort_values(by='col1', ascending=False)['col1'].values[:601]
sb.kdeplot(y)

Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.

JohanC · Accepted Answer · 2020-12-28T14:55:48.420

The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.

The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3 could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.

Here is some sample code to show the difference between bw='scott' and bw=0.3. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()

fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3})

for i, bw in enumerate(['scott', 0.3]):
    for j, num_same in enumerate([400, 450, 500]):
        y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)])
        sns.kdeplot(y, bw=bw, ax=axs[i, j])
        axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}')
plt.show()

The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.

PS: As mentioned by @mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde is used (not scipy.stats.gaussian_kde). There the default is "scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34). The min() clarifies the abrupt change in behavior. IQR is the "interquartile range", the difference between the 75th and 25th percentiles.

Edit: Since Seaborn 0.11, the statsmodel backend has been dropped, so kde's are only calculated via scipy.stats.gaussian_kde.

Just to add some context here: I think the explanation for the most surprising aspect of this (the abrupt change with an additional datapoint) is that statsmodels uses `min(iqr, sd)` when calculating Scott's factor and those measures of spread can be quite different for data with a lot of repeated observations. — mwaskom, May 14 '20 at 14:14
Thanks, that's very helpful. I had already experimented with changing 'scott' to various bw values, and while it seems to work well in the univariate plots, when I move to bivariate plots every value of bw I try doesn't show smaller populations at all. Are there guidelines on choosing an appropriate bw value? — iayork, May 14 '20 at 14:51
In general the gaussian KDE is meant for smooth continuous data. You might try to filter out the outliers and plot their count as a bar (or a small pie plot) separately. — JohanC, May 14 '20 at 15:08
Another approach that seems to work is to apply a small amount of random jitter to the minimum values — iayork, May 14 '20 at 15:46

score 2 · Answer 2 · answered May 17 '20 at 14:04

If the sample has repeated values, this implies that the underlying distribution is not continuous. In the data that you show to illustrate the issue, we can see a Dirac distribution on the left. The kernel smoothing might be applied for such data, but with care. Indeed, to approximate such data, we might use a kernel smoothing where the bandwidth associated to the Dirac is zero. However, in most KDE methods, there is only one single bandwidth for all kernel atoms. Moreover, the various rules used to compute the bandwidth are based on some estimation of the rugosity of the second derivative of the PDF of the distribution. This cannot be applied to a discontinuous distribution.

We can, however, try to separate the sample into two sub-samples:

the sub-sample(s) with replications,
the sub-sample with unique realizations.

(This idea has already been mentionned by johanc).

Below is an attempt to perform this classification. The np.unique method is used to count the occurences of the replicated realizations. The replicated values are associated with Diracs and the weight in the mixture is estimated from the fraction of these replicated values in the sample. The remaining realizations, uniques, are then used to estimate the continuous distribution with KDE.

The following function will be useful in order to overcome a limitation with the current implementation of the draw method of Mixtures with OpenTURNS.

def DrawMixtureWithDiracs(distribution):
    """Draw a distributions which has Diracs.
    https://github.com/openturns/openturns/issues/1489"""
    graph = distribution.drawPDF()
    graph.setLegends(["Mixture"])
    for atom in distribution.getDistributionCollection():
        if atom.getName() == "Dirac":
            curve = atom.drawPDF()
            curve.setLegends(["Dirac"])
            graph.add(curve)
    return graph

The following script creates a use-case with a Mixture containing a Dirac and a gaussian distributions.

import openturns as ot
import numpy as np
distribution = ot.Mixture([ot.Dirac(-3.0),
                          ot.Normal()], [0.5, 0.5])
DrawMixtureWithDiracs(distribution)

This is the result.

Then we create a sample.

sample = distribution.getSample(100)

This is where your problem begins. We count the number of occurences of each realizations.

array = np.array(sample)
unique, index, count = np.unique(array, axis=0, return_index=True,
                                 return_counts=True)

For all realizations, replicated values are associated with Diracs and unique values are put in a separate list.

sampleSize = sample.getSize()
listOfDiracs = []
listOfWeights = []
uniqueValues = []
for i in range(len(unique)):
    if count[i] == 1:
        uniqueValues.append(unique[i][0])
    else:
        atom = ot.Dirac(unique[i])
        listOfDiracs.append(atom)
        w = count[i] / sampleSize
        print("New Dirac =", unique[i], " with weight =", w)
        listOfWeights.append(w)

The weight of the continuous atom is the complementary of the sum of the weights of the Diracs. This way, the sum of the weights will be equal to 1.

complementaryWeight = 1.0 - sum(listOfWeights)
weights = list(listOfWeights)
weights.append(complementaryWeight)

The easy part comes: the unique realizations can be used to fit a kernel smoothing. The KDE is then added to the list of atoms.

sampleUniques = ot.Sample(uniqueValues, 1)
factory = ot.KernelSmoothing()
kde = factory.build(sampleUniques)
atoms = list(listOfDiracs)
atoms.append(kde)

Et voilà: the Mixture is ready.

mixture_estimated = ot.Mixture(atoms, weights)

The following script compares the initial Mixture and the estimated one.

graph = DrawMixtureWithDiracs(distribution)
graph.setColors(["dodgerblue3", "dodgerblue3"])
curve = DrawMixtureWithDiracs(mixture_estimated)
curve.setColors(["darkorange1", "darkorange1"])
curve.setLegends(["Est. Mixture", "Est. Dirac"])
graph.add(curve)
graph

The figure seems satisfactory, since the continuous distribution is estimated from a sub-sample which size is only equal to 50, i.e. one half of the full sample.

Seaborn KDEPlot - not enough variation in data?

2 Answers2

Linked