Alluvial plot with 2 different sources but a converging/shared variable [R]

Question

I have experience with making alluvial plots using the ggalluvial package. However, I have run in to an issue where I am trying to create an alluvial plot with two different sources that converge onto 1 variable.

here is example data

library(dplyr)
library(ggplot2)
library(ggalluvial)

data <- data.frame(
  unique_alluvium_entires = seq(1:10),
  label_1 = c("A", "B", "C", "D", "E", rep(NA, 5)),
  label_2 = c(rep(NA, 5), "F", "G", "H", "I", "J"),
  shared_label = c("a", "b", "c", "c", "c", "c", "c", "a", "a", "b")
)

here is the code I use to make the plot

#prep the data
data <- data %>%
  group_by(shared_label) %>%
  mutate(freq = n())

data <- reshape2::melt(data, id.vars = c("unique_alluvium_entires", "freq"))
data$variable <- factor(data$variable, levels = c("label_1", "shared_label", "label_2"))

#ggplot
ggplot(data,
       aes(x = variable, stratum = value, alluvium = unique_alluvium_entires,
           y = freq, fill = value, label = value)) +
  scale_x_discrete(expand = c(.1, .1)) + 
  geom_flow() +
  geom_stratum(color = "grey", width = 1/4, na.rm = TRUE) +
  geom_text(stat = "stratum", size = 4) +
  theme_void() +
  theme(
   axis.text.x = element_text(size = 12, face = "bold")
  )

resulting plot (apparently I cannot embed images yet)

As you can see, I can remove the NA values, but the shared_label does not properly "stack". Each unique row should stack on top of each other in the shared_label column. This would also fix the sizing issue so that they are equal size along the y axis.

Any ideas how to fix this? I have tried ggsankey but the same issue arises and I cannot remove NA values. Any tips is greatly appreciated!

score 0 · Accepted Answer · answered Dec 10 '21 at 17:58

0

This plot is the expected result of the "flow" statistical transformation, which is the default for the "flow" graphical object. (That is, geom_flow() = geom_flow(stat = "flow").) It looks like what you want is to specify the "alluvium" statistical transformation instead. Below i've used all your code but only copied and edited the ggplot() call.

#ggplot
ggplot(data,
       aes(x = variable, stratum = value, alluvium = unique_alluvium_entires,
           y = freq, fill = value, label = value)) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow(stat = "alluvium") +  # <-- specify alternate stat
  geom_stratum(color = "grey", width = 1/4, na.rm = TRUE) +
  geom_text(stat = "stratum", size = 4) +
  theme_void() +
  theme(
    axis.text.x = element_text(size = 12, face = "bold")
  )
#> Warning: Removed 2 rows containing missing values (geom_text).

^{Created on 2021-12-10 by the reprex package (v2.0.1)}

answered Dec 10 '21 at 17:58

Cory Brunson

668
4
10

This worked. Thank you! – Michael Caponegro Dec 15 '21 at 07:11
@Cory Brunson, will this code work for dataframe with missing values? Example in Python is here https://stackoverflow.com/questions/73512339/building-sankey-alluvial-plot-ignoring-nan-values – Anakin Skywalker Aug 27 '22 at 16:07
1

@AnakinSkywalker wow, i did not know that there's a Python package! Anyway, yes, missing values should pose no problem, though they can be handled in different ways. See [these examples](http://corybrunson.github.io/ggalluvial/reference/stat_alluvium.html). – Cory Brunson Aug 29 '22 at 09:35
@CoryBrunson, thanks, I will check! Added you on Linkedin too. If I nail it in R, thanks in advance. – Anakin Skywalker Aug 29 '22 at 15:59
@CoryBrunson, how about this https://stackoverflow.com/questions/73540146/transforming-data-with-nas-in-ggaluvial-format-and-visualizing-as-a-alluvial-plo – Anakin Skywalker Aug 30 '22 at 09:31

Alluvial plot with 2 different sources but a converging/shared variable [R]

1 Answers1

Linked