2

I am usingsns.kdeplot(data) to obtain a Kernel Density Estimate for my 1 dimensional dataset.

As I understand and having read seaborn's documentation on kdeplot, sns.kdeplot() passes bw_method="Scott" to scipy.stats.gaussian_kde to automatically obtain a rule-based bandwidth to smoothening the kde plot in question.

Can I access the bandwidth that was automatically used by seaborn for its kdeplot? My idea was to reproduce seaborn's steps through scipy.stats.gaussian_kde and applying the rule as per documentation len(data)**(-1./(1+4)) (Scott's rule) to obtain a value for bw, but I find the obtained value to produce a visually different kdeplot to seaborn's plot. In other words, what is the bw_rule, so that sns.kdeplot(data,bw=bw_rule) == sns.kdeplot(data)?

CodeTrek
  • 435
  • 1
  • 3
  • 10
  • Well I got it (if anyone is writing this or curious) through kde_gaussian.covariance_factor()*np.std(data) but I am still wondering now if these are somewhere stored in sns that can be directly accessed instead of recalculated? – CodeTrek Dec 26 '20 at 22:29
  • I tried KDEUnivariate with "scott"-1.059*A*nobs**(-1/5.) as per your comment but with mixed results. The kdeplot is often underfitting seaborn's KDE. However, kde_gaussian.covariance_factor()*np.std(data) seems now identical to sns.kdeplot when plotting the rule over seaborns. The identity seems to hold for every sample data that I've tried so far but it's too early to generalize. Out of curiosity, where is it written that seaborn uses kdeunivariate? Looking at this [doc](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) it appears to refer to scipy.stats.gaussian_kde. – CodeTrek Dec 27 '20 at 14:18
  • Thanks, yes probably it might have changed since then. The github source imports gaussian_kde it seems [line 36]. – CodeTrek Dec 27 '20 at 15:45
  • 1
    seaborn 0.11+ exclusively uses scipy for KDE computation. Prior to that, either statsmodels or scipy or were used, depending on whether the former was installed. There are some differences in how those two libraries interpret the bandwidth parameter (scipy uses it directly as the sd of the kernel; in scipy it is a multiplicative factor that scales the data sd to get the kernel size). This is [one of the reasons](https://github.com/mwaskom/seaborn/pull/2104) the backend was simplified. – mwaskom Dec 27 '20 at 20:38
  • 1
    The most relevant section of the code is [here](https://github.com/mwaskom/seaborn/blob/master/seaborn/_statistics.py#L134) – mwaskom Dec 27 '20 at 21:05

0 Answers0