transform histogram to violin plot in r with ggplot

Question

I am currently trying to learn r with the help of Hadley Wickham's great resources ("r for data scientists", "ggplot2 Elegant Graphics for Data Analysis"). So far I was able to find answers to all my problems there (thank you so much, Hadley!), but not this time.

Currently, I am working with data from an instrument that estimates particle size by the light the particles scatter (DLS, Zetasizer Nano, Malvern Instruments). The data extracted from this device are some summary statistics (e.g. mean particle size) and histogram data: x = size (split in bins), y = intensity [%]. Here is a tibble of one of my measurements:

   # A tibble: 70 x 3
   sample_name        intensities      bins
   <chr>                    <dbl>     <dbl>
 1 core formulation 1         0       0.4  
 2 core formulation 1         0       0.463
 3 core formulation 1         0       0.536
 4 core formulation 1         0       0.621
 5 core formulation 1         0       0.720
 6 core formulation 1         0       0.833
 7 core formulation 1         0       0.965
 8 core formulation 1         0       1.12 
 9 core formulation 1         0       1.29 
10 core formulation 1         0       1.50 
11 core formulation 1         0       1.74 
12 core formulation 1         0       2.01 
13 core formulation 1         0       2.33 
14 core formulation 1         0       2.70 
15 core formulation 1         0       3.12 
16 core formulation 1         0       3.62 
17 core formulation 1         0       4.19 
18 core formulation 1         0       4.85 
19 core formulation 1         0       5.62 
20 core formulation 1         0       6.50 
21 core formulation 1         0       7.53 
22 core formulation 1         0       8.72 
23 core formulation 1         0      10.1  
24 core formulation 1         0      11.7  
25 core formulation 1         0      13.5  
26 core formulation 1         0      15.7  
27 core formulation 1         0      18.2  
28 core formulation 1         0      21.0  
29 core formulation 1         0      24.4  
30 core formulation 1         0      28.2  
31 core formulation 1         0      32.7  
32 core formulation 1         0      37.8  
33 core formulation 1         0      43.8  
34 core formulation 1         0.2    50.8  
35 core formulation 1         1.4    58.8  
36 core formulation 1         3.7    68.1  
37 core formulation 1         6.9    78.8  
38 core formulation 1        10.2    91.3  
39 core formulation 1        12.9   106.   
40 core formulation 1        14.4   122.   
41 core formulation 1        14.4   142.   
42 core formulation 1        13     164.   
43 core formulation 1        10.3   190.   
44 core formulation 1         7.1   220.   
45 core formulation 1         3.9   255    
46 core formulation 1         1.5   295.   
47 core formulation 1         0.2   342    
48 core formulation 1         0     396.   
49 core formulation 1         0     459.   
50 core formulation 1         0     531.   
51 core formulation 1         0     615.   
52 core formulation 1         0     712.   
53 core formulation 1         0     825    
54 core formulation 1         0     955.   
55 core formulation 1         0    1106    
56 core formulation 1         0    1281    
57 core formulation 1         0    1484    
58 core formulation 1         0    1718    
59 core formulation 1         0    1990    
60 core formulation 1         0    2305    
61 core formulation 1         0    2669    
62 core formulation 1         0    3091    
63 core formulation 1         0    3580    
64 core formulation 1         0    4145    
65 core formulation 1         0    4801    
66 core formulation 1         0    5560    
67 core formulation 1         0    6439    
68 core formulation 1         0    7456    
69 core formulation 1         0    8635    
70 core formulation 1         0   10000

Here is the data produced with the dput() command:

structure(list(sample_name = c("core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1"), intensities = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2, 1.4, 3.7, 6.9, 10.2, 12.9, 
14.4, 14.4, 13, 10.3, 7.1, 3.9, 1.5, 0.2, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), bins = c(0.4, 
0.4632, 0.5365, 0.6213, 0.7195, 0.8332, 0.9649, 1.117, 1.294, 
1.499, 1.736, 2.01, 2.328, 2.696, 3.122, 3.615, 4.187, 4.849, 
5.615, 6.503, 7.531, 8.721, 10.1, 11.7, 13.54, 15.69, 18.17, 
21.04, 24.36, 28.21, 32.67, 37.84, 43.82, 50.75, 58.77, 68.06, 
78.82, 91.28, 105.7, 122.4, 141.8, 164.2, 190.1, 220.2, 255, 
295.3, 342, 396.1, 458.7, 531.2, 615.1, 712.4, 825, 955.4, 1106, 
1281, 1484, 1718, 1990, 2305, 2669, 3091, 3580, 4145, 4801, 5560, 
6439, 7456, 8635, 10000)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -70L))

I can produce a histogram with no problems from this data:

library(tidyverse)
ggplot (DLS_intensities_core, aes(bins,intensities) ) + 
  geom_line() + 
  scale_x_continuous(trans = 'log10')

line plot, intensities against bins

In order to show the overall distribution of my particle size, I would like to convert this data into a violin plot and use the summary statistics provided by the device in a second layer of my plot.

Therefore, I would like to transform this data to be able to create a violin plot from it.

I have already tried feeding it to the stat_density () argument of the violin plot but so far with no success.

Do you know how to create a violin plot from this data?

Thank you very much!

Best,

Dominik

Welcome to SO! Could you jump over to https://stackoverflow.com/tags/r/info and take a look at the suggestions for how to post data (some are in links off of that URL)? I think the SO R contributors could make quick work of this if the data was in a tad easier format to work with. — hrbrmstr, Nov 25 '18 at 16:33
I am sorry, I will add it immediately! thank you for pointing it out! — Dome, Nov 25 '18 at 16:39
Wow! Best data dump of the day here! (and, no apologies are required. the SO question entry screen leaves much to be desired). Are you just looking for a violin plot of `DLS_intensities_core$bins` and `DLS_intensities_core$intensities` or are there some other transformations you needed performed on them? — hrbrmstr, Nov 25 '18 at 17:30
Thanks :-) I think I am just looking for a violin plot of `DLS_intensities_core$bins` and `DLS_intensities_core$intensities`. In the end, I want to produce a plot with the following specs: x-axis: particle size [nm]; y-axis: sample name (e.g. "core formulation"); 1st. layer: violin plot (from the data I presented here); 2nd. layer: summary statistics (calculated by the Zetasizer, not shown here for clarity) That means I will need to do some additional transformations for sure, but I want to try it myself first ;-) — Dome, Nov 25 '18 at 17:43

hrbrmstr · Answer 1 · 2018-11-25T18:22:57.083

I'll update this (if needed) after your reply to the second comment. You can get violin plots of both bins and intensities with:

library(hrbrthemes)

gather(DLS_intensities_core, measure, value, -sample_name) %>% 
  ggplot(aes(measure, value)) +
  geom_violin(scale = "count") +
  scale_y_comma() +
  facet_wrap(~measure, scales="free") +
  labs(
    x = NULL, y = "A better label than this",
    title = "A better title than this",
    caption = "NOTE: Free Y scales"
  ) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.x = element_blank())

I generally like to layer in the points as well:

gather(DLS_intensities_core, measure, value, -sample_name) %>% 
  ggplot(aes(measure, value)) +
  geom_violin(scale = "count") +
  ggbeeswarm::geom_quasirandom() +
  scale_y_comma() +
  facet_wrap(~measure, scales="free") +
  labs(
    x = NULL, y = "A better label than this",
    title = "A better title than this",
    caption = "NOTE: Free Y scales"
  ) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.x = element_blank())

Per your comment, perhaps this might be a better way to show the bins distribution along with the relationship with intensities:

library(hrbrthemes)
library(tidyverse)

ggplot(DLS_intensities_core, aes(x="", bins)) +
  geom_violin(scale = "count") +
  ggbeeswarm::geom_quasirandom(
    aes(size = intensities, fill = intensities), shape = 21
  ) +
  scale_y_comma(trans="log10") +
  viridis::scale_fill_viridis(direction = -1, trans = "log1p") +
  scale_size_continuous(trans = "log1p", range = c(2, 10)) +
  guides(fill = guide_legend()) +
  labs(
    x = NULL, y = "A better label than this",
    title = "A better title than this"
  ) +
  theme_ipsum_rc(grid="Y")

You'd have to do some other, custom transformation to try to get the violin shape to vary with the intensities (and it wouldn't really be reflecting the distribution at that point).

thank you very much for your plots and the associated code, I like the theme of your plots very much! And the addition of dots to violin plots is a good idea :-) The first plot (bins) is really close to what I want to have: Size (or bins) on the y -axis (in log10 scale) and the expansion in x-direction of the plot is the value of the intensity (not the number of cases). Does this help or is it just the wrong plot for this idea? — Dome, Nov 25 '18 at 18:11
yes, this goes in the right direction! However, I would like the intensities to be depicted by the extension of the violin plot in x direction (in a way that the biggest extension of the violin plot is at around 100 on the y-axis, at the position that you printed the big dots at.). — Dome, Nov 26 '18 at 09:18
I had another thought while in the shower: Maybe my tibble is not in the right format for this type of plot. Since the violin plot's stat counts rows and depicts a summary, the most straight forward way would be to multiplay the bin size by intensity and create the corresponding amount of rows with this precise (bin-) size. E.g. if the size is 100 and the corresponding intensity 20, then I would create 20 new rows with bin = 100. I will try this approach today and let you know what I found :-) Thanks again for your help! :-) — Dome, Nov 26 '18 at 09:18

Dome · Accepted Answer · 2018-11-27T14:16:29.193

I found a solution to my problem, it is probably not very elegant:

library (tidyverse)

DLS_intensities_core <- DLS_intensities_core %>% 
  mutate(counts = intensities * 10 )

vectors <- DLS_intensities_core %>%
  filter(counts > 0) 

bins_v <- vectors$bins
count_v <- vectors$counts

violin_DLSdata <- as.tibble(rep.int(bins_v, count_v))
violin_DLSdata$sample_name <- "core formulation 1"

ggplot (violin_DLSdata, aes(sample_name, value)) + 
  geom_violin() + 
  labs(
    x = NULL, y = "size"
  ) +
  scale_y_continuous(trans = 'log10', limits = c(1, 1000))

for my whole dataset it looks like this: I have added: summary statistic as red dots with errorbars.

What do you think?

transform histogram to violin plot in r with ggplot

2 Answers2