6

lineslist, below, represents a set of lines (for some chemical spectrum, let's say), in MHz. I know the linewidth of the laser used to probe these lines to be 5 MHz. So, naively, the kernel density estimate of these lines with a bandwidth of 5 should give me the continuous distribution that would be produced in an experiment using the aforementioned laser.

The following code:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 ,  -75.71982528,  -12.1897835 ,  -73.94903264,
   -178.14293936, -123.51339541, -118.11826988,  -50.19812838,
    -43.69282206,  -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5)
plt.show()

yields

Predicted continuous experimental spectrum, based on theoretical lines, bandwidth=5 MHz

Which looks like a Gaussian with bandwidth much larger than 5 MHz.

I'm guessing that for some reason, the bandwidth of the kdeplot has different units than the plot itself. The separation between the highest and lowest line is ~170.0 MHz. Supposing that I need to rescale the bandwidth by this factor:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 ,  -75.71982528,  -12.1897835 ,  -73.94903264,
   -178.14293936, -123.51339541, -118.11826988,  -50.19812838,
    -43.69282206,  -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5/(np.max(lineslist)-np.min(lineslist)))
plt.show()

I get: enter image description here

With lines that seem to have the expected 5 MHz bandwidth.

As dandy as that solution is, I've pulled it from my arse, and I'm curious whether someone more familiar with seaborn's kdeplot internals can comment on why this is.

Thanks,

Samuel

Samuel Markson
  • 97
  • 1
  • 2
  • 8
  • The bandwidth parameters are chosen by heuristics, where you can chose between 2 different ones. There are cases where this fails. Normally one uses cross-validation to estimate this param, which is not possible with seaborn. Gridsearch-based CV is possible with scikit-learn, optimization-based CV is possible with statsmodels. – sascha Jun 20 '16 at 22:43
  • Thanks Sascha. As I understand, you are referring to Scott's and Silverman's rules. The other option--again, as I understand--is setting the bandwidth explicitly, as I've done above. – Samuel Markson Jun 21 '16 at 02:10

1 Answers1

10

One thing to note is that Seaborn doesn't actually handle the bandwidth itself - it passes the setting on more-or-less as-is to either SciPy or the Statsmodels packages, depending on what you have installed. (It prefers Statsmodels, but will fall back to SciPy.)

The documentation for this parameter in the various sub-packages is a little confusing, but from what I can tell, the key issue here is that the setting for SciPy is a bandwidth factor, rather than a bandwidth itself. That is, this factor is (effectively) multiplied by the standard deviation of the data you're plotting to give you the actual bandwidth used in the kernels.

So with SciPy, if you have a fixed number which you want to use as your bandwidth, you need to divide through by your data standard deviation. And if you're trying to plot multiple datasets consistently, you need to adjust for the standard deviation of each dataset. This adjustment effectively what you did by scaling by the range -- but again, it's not the range of the data that's the number used, but the standard deviation of the data.

To make things all the more confusing, Statsmodels expects the true bandwidth when given a scalar value, rather than a factor that's multiplied by the standard deviation of the sample. So depending on what backend you're using, Seaborn will behave differently. There's no direct way to tell Seaborn which backend to use - the best way to test is probably trying to import statsmodels, and seeing if that succeeds (takes bandwidth directly) or fails (takes bandwidth factor).

By the way, these results were tested against Seaborn version 0.7.0 - I expect (hope?) that versions in the future might change this behavior.

R.M.
  • 3,461
  • 1
  • 21
  • 41
  • 1
    Great find R.M. Indeed, I hadn't installed statsmodels, and installing it fixes this problem. – Samuel Markson Jul 19 '16 at 04:16
  • Hi Samuel, I wish to produce 2D KDE plots using Seaborn, but with a KDE bw which doesn't alter with standard deviation (I'm trackig the location of an object, but wish only to convey to the reader a sense of uncertainty in location precision). I therefore think I need to implement a KDE with an absolute bw value unchanging between plots. I think you have worked out how to utilise the statsmodel bw reference - but I'm not sure how you did it! Would you mind describing what you did. It wasn't simply an import of statemodels I'm sure?! – thescoop Aug 20 '18 at 13:04
  • I wish to set an unchangeable bandwidth. – thescoop Aug 20 '18 at 13:12
  • I wish to set a non-changing bandwith (no-matter what my dataset). You say "It prefers Statsmodels, but will fall back to SciPy", but how do I force it to use statsmodles, and how do I know it's then using statsmodels for sure? – thescoop Aug 20 '18 at 13:13
  • 1
    @thescoop IIRC, it was entirely based on whether or not statsmodel was installed. So simply install statsmodel to the Python you're using (e.g. with pip). -- Note that I haven't looked at newer Seaborn versions - I don't know if their behavior is any different from 0.7.0. – R.M. Aug 20 '18 at 15:02