0

my problem can be summed up by the plots at the bottom of this post.

They show progressive zooming in of the buggy pairgrid, with the key plots being on the left column. Essentially, the points in my pairgrid are annoyingly very scattered, however as can be seen on the 3rd plot the bulk of them are still fairly localised in what I expected to be a Gaussian distribution.

Unfortunately, the KDE contour plot seems to completely miss the main bulk of the points, and orders itself around a few outliers.

Here's the code I'm using the generate the plots from a pandas DataFrame:

import seaborn as sns
from matplotlib import pyplot as plt

g = sns.PairGrid(HP, diag_sharey=False)
g.map_lower(sns.kdeplot, n_levels=5)
g.map_lower(plt.scatter, marker='^', alpha=0.7, color='y')
g.map_upper(plt.scatter, marker='+')
g.map_diag(sns.kdeplot)

I'm trying to figure out why this is happening. Does the kdeplot select only a subsample of the points or what?

enter image description here

enter image description here

enter image description here

Marses
  • 1,464
  • 3
  • 23
  • 40
  • Can't say for sure without access to your data but your distribution is extremely skewed and a gaussian KDE assumes that a gaussian is a reasonably good fit to the distribution. – mwaskom Nov 04 '16 at 14:19
  • Well my data is indeed pretty trashy, I'm trying to fix that separately :D. However, as bad as it may be, the KDE isn't doing what I expected. From looking at the above, it doesn't seem like the main bulk of the data is contributing any Gaussian kernels at all. But I guess it could also be a problem somewhere with the contour plotter if the pdf has too many features and small peaks. – Marses Nov 04 '16 at 15:32
  • 1
    It's not a matter of good or bad data, it's a matter of data that matches statistical assumptions. e.g. the x variable in the middle plot has extremely high kurtosis. Fitting that distribution with a gaussian kernel that is only matched based on the variance will look like the contours are "missing data". I'd encourage you to focus on a single variable or bivariate relationship and play around with the bandwidith of the KDE (or see what fitting a gaussian to the distribution itself looks like) to get a better intuition for what is happening. – mwaskom Nov 04 '16 at 16:15

0 Answers0