1

I am trying to use an Alluvial plot (Sankey diagram) to show the change of different categories over two time periods. I was able to create a plot that makes sense to me when all the factor levels are represented in both time periods (pre and post), but with my data the plot looks strange after changing the order of the factors. I also would like to show the same fill colors for the categories from both time periods, but could only change the first time period (pre). When I do the plot, I notice that the colors that I specify are not the ones that I want for each of the factor levels, despite the order of the boxes / stratum is correct.

Any help or advice on how to improve the plot and how I can overcome the problem of ordering the factor levels from both groups when the categories ar not fully represented in both time periods would be so helpful.

Here is the code:

    db <- read.table(text = "pre    post    freq
NE  NE  0
NE  DD  2
NE  LC  5
NE  NT  2
NE  VU  3
NE  EN  5
NE  CR  1
DD  NE  0
DD  DD  3
DD  LC  37
DD  NT  10
DD  VU  14
DD  EN  3
DD  CR  3
LC  NE  0
LC  DD  0
LC  LC  18
LC  NT  2
LC  VU  1
LC  EN  2
LC  CR  0
NT  NE  0
NT  DD  1
NT  LC  3
NT  NT  8
NT  VU  13
NT  EN  5
NT  CR  1
VU  NE  0
VU  DD  0
VU  LC  1
VU  NT  0
VU  VU  7
VU  EN  8
VU  CR  3
EN  NE  0
EN  DD  0
EN  LC  0
EN  NT  0
EN  VU  0
EN  EN  0
EN  CR  2
CR  NE  0
CR  DD  0
CR  LC  1
CR  NT  0
CR  VU  0
CR  EN  0
CR  CR  2
", header=T)

head(db)


# Order factor levels
levels(db$pre) <- c("NE", "DD", "LC", "NT", "VU", "EN", "CR")
levels(db$post) <- c("NE", "DD", "LC", "NT", "VU", "EN", "CR")

# Set colors for the plot
colors.p <- c("#282828", "#7C7C7C", "#20AB5F", "#3EFF00", 
              "#FBFF00", "#FFBD00", "#FF0C00")

# Plot
p <- ggplot(db,
            aes(y = freq, axis1 = pre, 
                axis2 = post)) +
  geom_alluvium(aes(fill = pre), show.legend = FALSE) +
  geom_stratum(aes(fill = pre), color = "black", alpha = 0.5) +
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
  scale_x_discrete(limits = c("previous", "current"), 
                   expand = c(0.3, 0.01)) +
  scale_fill_manual(values = colors.p) +
  theme_void() +
  theme(
    panel.background = element_blank(),
    axis.text.y = element_blank(),
    axis.text.x = element_text(size = 15, face = "bold"),
    axis.title = element_blank(),
    axis.ticks = element_blank(),
    legend.position = "none"
  )

p  

user1626688
  • 1,583
  • 4
  • 18
  • 27

1 Answers1

2

I gave it a shot with a different package I am more familiar with (ggsankey). I also removed one category from each of the timepoints to illustrate the factor reordering and that this is possible. Does this solve your issues? If not, please clarify what you are still missing.

library(tidyverse)
library(ggsankey)

db <- data.frame(pre = rep(c("DD", "LC", "NT",
                             "VU", "EN", "CR"), each = 6),
                 post = rep(c("DD", "LC", "NT",
                              "VU", "EN", "CR"), times = 6),
                 freq = rep(sample(seq(0:20), 6), 6))
db %>% 
  uncount(freq) %>%
  filter(pre != "DD", post != "NT") %>%
  make_long(pre, post) %>%
  mutate(node = fct_relevel(node, "LC", "NT", "VU", "EN", "CR"), 
         next_node = fct_relevel(next_node, "DD", "LC", "VU", "EN", "CR")) %>%
  ggplot(aes(x = x, 
             next_x = next_x, 
             node = node, 
             next_node = next_node,
             fill = factor(node))) +
  geom_alluvial() +
  scale_fill_manual(values = c("DD" = "#7C7C7C", "LC" = "#20AB5F", "NT" = "#3EFF00", "VU" = "#FBFF00", "EN" = "#FFBD00", "CR" = "#FF0C00"))

enter image description here

EDIT: For your new data the previous approach I posted still works. You need to add the additional level ("NE") in the factor releveling for the pre timepoint and as a new color (blue in this example). What error do you get with this data?

library(tidyverse)
library(ggsankey)

db <- read.table(text = "pre    post    freq
NE  NE  0
NE  DD  2
NE  LC  5
NE  NT  2
NE  VU  3
NE  EN  5
NE  CR  1
DD  NE  0
DD  DD  3
DD  LC  37
DD  NT  10
DD  VU  14
DD  EN  3
DD  CR  3
LC  NE  0
LC  DD  0
LC  LC  18
LC  NT  2
LC  VU  1
LC  EN  2
LC  CR  0
NT  NE  0
NT  DD  1
NT  LC  3
NT  NT  8
NT  VU  13
NT  EN  5
NT  CR  1
VU  NE  0
VU  DD  0
VU  LC  1
VU  NT  0
VU  VU  7
VU  EN  8
VU  CR  3
EN  NE  0
EN  DD  0
EN  LC  0
EN  NT  0
EN  VU  0
EN  EN  0
EN  CR  2
CR  NE  0
CR  DD  0
CR  LC  1
CR  NT  0
CR  VU  0
CR  EN  0
CR  CR  2
", header=T)
db %>% 
  uncount(freq) %>%
  make_long(pre, post) %>%
  mutate(node = fct_relevel(node,"DD", "LC", "NT","NE", "VU", "EN", "CR"), 
         next_node = fct_relevel(next_node, "DD", "LC", "NT", "VU", "EN", "CR")) %>%
  ggplot(aes(x = x, 
             next_x = next_x, 
             node = node, 
             next_node = next_node,
             fill = factor(node))) +
  geom_alluvial() +
  scale_fill_manual(values = c("DD" = "#7C7C7C", "LC" = "#20AB5F", "NT" = "#3EFF00", "VU" = "#FBFF00", "EN" = "#FFBD00", "CR" = "#FF0C00", "NE" ="blue"))

enter image description here

Mojoesque
  • 1,166
  • 8
  • 15
  • Thanks. It is working but still producing a warning message after I incorporate the missing level you removed and when I added an extra level that was not shown in my original post. (Warning message: Problem with `mutate()` input `next_node`. ℹ Unknown levels in `f`: NE ℹ Input `next_node` is `fct_relevel(next_node, "NE", "DD", "LC", "VU", "EN", "CR")`. ). I also would like to remove the legend but identify each level in the graph with a label and I am not too familiar with this package. – user1626688 Oct 27 '21 at 12:48
  • Depending on which levels are available at which timepoints you have to adapt the `fct_relevel()` call to exactly the levels that occur in the node (pre) or the next_node (post). In my example I removed "DD" from the pre timepoint and "NT" from the post timepoint. Therefore these factors are also missing in the `fct_reorder()` call for the respective timepoints. Did you adapt the factor reordering according to the factor levels you added back / in addition? – Mojoesque Oct 27 '21 at 12:54
  • I just edit the code to reflect the problem I am having with my dataset. I am not sure if the problem is because some time period combinatinations are zero. When I tried using tidyverse approach I kept having an error too. – user1626688 Oct 27 '21 at 14:11
  • The same approach still works for your new data. I edited my answer. What error are you getting? – Mojoesque Oct 27 '21 at 14:25
  • I get this error and no plot yet: Error: `n()` must only be used inside dplyr verbs. – user1626688 Oct 27 '21 at 14:34
  • If you only run exactly the code I posted, does that work for you? I am not using the function n() anywhere. – Mojoesque Oct 27 '21 at 14:35
  • Ok, I figured out what the error was. – user1626688 Oct 27 '21 at 15:58