-2

I'm trying to figure out the difference between KDE plots in seaborn and distplot in plotly. While I understand that both try to estimate the underlying distribution of the data, I'm not sure how exactly.

For instance, I tried plotting the kde plot and dist plot of two variables from the same dataset.

This is the kde plot using seaborn:

and this is the dist plot using plotly:

What is the difference between these 2 graphs and how would be interpret them. Also, can KDE plots be used in an imbalanced dataset( i.e. there are 2 categories of the dependent variable and number of datapoints under each category differ largely)

Derek O
  • 16,770
  • 4
  • 24
  • 43
swastika
  • 3
  • 3

1 Answers1

1

The reason the two plots don't look the same is that you are passing the entire data set to sns.kdeplot and seaborn understands that there are two categories and that the area under the both curves combined together must sum to 1.

When you pass two separate curves to ff.create_distplot in plotly, the area under each curve individually sums to 1.

If you were to use the same code to plot an imbalanced dataset where the class imbalance is 90-10 for example, the minority category in the kde plot in seaborn would appear like a flat line with near 0 density, whereas the distplot in plotly would consider the two categories separately from one another so you might not know that you were viewing an imbalanced data set.

Derek O
  • 16,770
  • 4
  • 24
  • 43
  • So if we're trying to compare the distribution of the 2 categories, we should compare their individual curves( such that area under each equals 1), right? – swastika Jun 07 '22 at 12:03
  • @swastika if there's a large imbalance between the two categories, yes – you'll want to compare their individual curves so that the distributions are clear – Derek O Jun 07 '22 at 15:18