0

I have overlayed violin plots comparing group A and group B scores for a particular section of a survey, facet wrapped by section. The scores are discrete 1-7 values. In some of these violin plots, the smoothing works as expected. In others, one group or the other looks very "wavy" between discrete scores (shown below).

I thought the problem may be a difference in the group sizes, but then surely the "waviness" would appear in all the section plots.

Also, this doesn't explain to me why the plots "dip in" despite being discrete 1-7 values.

When I add the adjust parameter it over-smooths the already smooth sections, so it's not quite ideal.

I use this code to create the plots

create_violin_across_groups_by_section <- function(data, test_group="first") {
  g <- ggplot(data) + 
    aes(x=factor(nrow(data)),y=score,fill=group) +
    geom_violin(alpha=0.5,position="identity") +
    facet_wrap("section") +
    labs(
      title = paste("Comparison across groups for ", test_group)
    ) 
  return(g)
}

which results in something like this

violin plot with one wavy section

in this case, "openness," is oddly wavy while the others all appear to be smoothed as normal.

I've thought perhaps it has something to do with the x=factor(nrow(data)) but again, surely the waviness would appear in all the section plots.

I would expect either all of the plots to be wavy (though I still wouldn't understand why) or all of them to have the same smoothness.

How can I make all of the facet-wrapped plots have the same smoothness, and why are they different in the first place?

Thanks all

Carolyn
  • 93
  • 8
  • 1
    Since violin plots show kernel density estimates, the smoothness is determined by the smoothing bandwidth (or smoothing parameter). The "wavy" effect is most likely due to a smoothing bandwidth that is too small for the data. See for example the Wikipedia entry on [bandwidth selection](https://en.wikipedia.org/wiki/Kernel_density_estimation#Bandwidth_selection). – Maurits Evers Aug 07 '19 at 10:28
  • 1
    Upon re-reading your post I only now notice that you're showing kernel estimates based on discrete data. I'm not sure this makes sense statistically. There exist various methods to provide kernel density estimators for discrete data but I'm not sure whether R's `density` includes those methods. See for example [A kernel estimator for discrete distributions](https://www.tandfonline.com/doi/abs/10.1080/10485259508832629). – Maurits Evers Aug 07 '19 at 10:43

1 Answers1

2

The shape of the violin plot is calculated with a kernel density estimation. Kernel density estimations are designed for continuous data and not for discrete data, like your scores. While you can feed discrete data to the kernel estimator, the result may not always be beautiful or even meaningful. You can try to use different kernel and bw argument values in the geom_violin or you might consider something designed for discrete data, such as geom_dotplot.

+ geom_dotplot(binaxis = "y", stackdir = "center", position = "dodge")

Check out the corresponding example of geom_dotplot https://ggplot2.tidyverse.org/reference/geom_dotplot.html for a preview of how it can look like.

Check out the kernel and bw description of the violin plot https://ggplot2.tidyverse.org/reference/geom_violin.html that points to the density function https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/density for further information on how kernel density estimations are calculated.

zeehio
  • 4,023
  • 2
  • 34
  • 48
  • 1
    @Carolyn *"Kernel density estimations are designed for continuous data and not for discrete data, like your scores."* (+1) Yes, this is the critical bit;-) There exist modified kernel density estimators for discrete data but that's not what `density` does. So in short, don't use violin plots/density plots for discrete data. – Maurits Evers Aug 07 '19 at 10:48
  • Great, thanks all! So as far as I can tell, geom_dotplot is essentially the same thing but for discrete data? – Carolyn Aug 07 '19 at 10:52