2

I have this dataframe with timepoints (a, b and c), labels (l1, l2, l3) and frequencies that are distributed over the timepoints and labels. I want to create a sankey diagram with the ggalluvial package in R. Here's some code:

library(tidyverse)
library(forcats)
library(ggalluvial)
library(magrittr)

   plotAlluvial <- function(.df,name=freq) {
      y_name <- enquo(name)
      ggplot(.df,
         aes(
           x = tp,
           stratum = lbl,
           alluvium = id,
           label=lbl,
           fill = lbl,
           y=!!y_name
         )
       ) +
       geom_stratum() +
       geom_flow(stat = "flow", color = "darkgray") +
       geom_text(stat = "stratum")  +
       scale_fill_brewer(type = "qual", palette = "Set2") 
}

x1=c(6,0,0,5,5,4,2,0,3)
x2=c(5,5,3,0,0,5,0,7,0)
df=data_frame(tp1=rep(c('a','b'),each=9),
              lbl1=c(rep(c('l1','l2','l3'),2,each=3)),
              tp2=rep(c('b','c'),each=9),
              lbl2=c(rep(c('l1','l2','l3'),6)),
              freq=c(x1,x2)
)

df2=df %>% 
  mutate(id=row_number()) %>% 
  unite(un1,c(tp1,lbl1)) %>%
  unite(un2,c(tp2,lbl2)) %>%
  tidyr::gather(key,value,-c(freq,id)) %>%
  separate('value',c('tp','lbl')) 
df2.left= df2 %>% 
  dplyr::filter(!(key=='un1' & tp=='b'))
df2.right= df2 %>% 
  dplyr::filter(!(key=='un2' & tp=='b'))

I can plot the left side and plot the right side of the diagram I want:

plotAlluvial(df2.left)
plotAlluvial(df2.right)

enter image description here enter image description here

But if I try to plot the left and right side at the same time I get this plot:

plotAlluvial(df2)

enter image description here

When I use the code above, the plot of the diagram has too many frequencies at timepoint b. The stratum should be as high as the other two stratums so have a height of 25. What am I doing wrong? How can I create a diagram that combines the first two plots?

EDIT:

After a comment I added a proportion of the frequencies variable. Now the stratum b is of the correct height but the incoming and outgoing flows still only occupy 50% of each condition in timepoint b.

df2 %<>% group_by(tp) %>% mutate(prop = freq / sum(freq)) %>%
ungroup() 
plotAlluvial(df2,prop)

enter image description here

Robert
  • 924
  • 1
  • 9
  • 20
  • 2
    Could you please provide a desired diagram? Unfortunately I cannot understand what wrong with your output – atsyplenkov Nov 21 '18 at 16:38
  • Sorry I can't. But I want the the timepoint b stratum as high as the other two stratums. So freq should also be 25. – Robert Nov 21 '18 at 19:38
  • 1
    Do you basically want to be able to use `position = "fill"` (as documented [here](https://ggplot2.tidyverse.org/reference/position_stack.html)) with the **ggalluvial** geoms, and plot percentages rather than values? This isn't possible at present; you'd need to transform the data before calling `ggplot()` to achieve this. – Cory Brunson Nov 22 '18 at 22:02
  • Thanks Cory! It looks indeed like the thing I want. Do you have any idea how I should transform the data above so I can make the diagram I want? I have changed my question and added more pictures to better explain what I want. – Robert Nov 23 '18 at 08:32
  • @Robert, on closer inspection, i seems like you might be mis-encoding your data. Are you trying to track the distribution of 25 subjects over three labels along three time points? In that case, here's an artificial data frame of the correct (long form) structure: `tidyr::crossing(sub = 1:25, tp = letters[1:3]) %>% mutate(lbl = sample(paste0("l", 1:3), 25 * 3, replace = TRUE))`. – Cory Brunson Nov 23 '18 at 18:51
  • Sorry, i sent my last comment prematurely. If your data is correct and you want to use proportions, try this transformation before plotting: `df2 %>% group_by(tp) %>% mutate(prop = freq / sum(freq)) %>% ungroup()`. – Cory Brunson Nov 23 '18 at 19:14
  • I tried your suggestion and changed the code and edited my question. I don't think I can make this work with ggalluvial when the incoming and outgoing flows are different like at timepoint b. – Robert Nov 26 '18 at 09:47
  • @Robert you're right, sorry—i might have tried out that code with an edited data set. Are your data repeated observations of the same subjects at different time points, as in the artificial data set in my comment above? If not, then probably **ggalluvial** is not appropriate for them. – Cory Brunson Nov 29 '18 at 17:50

0 Answers0