7

Questions

I'm trying to visualize panel data on individuals that includes both a discrete or categorical choice and a continuous choice in each time period. One common example of this situation is customers purchasing a product/subscription and then choosing how frequently to use the product/service.

I would like to show "flows" across time periods weighted by the continuous variable in each time period -- some sort of cross between a weighted stacked bar chart and a sankey or alluvial diagram. Sankey and alluvial diagrams fundamentally represent flows between nodes, where each flow has a single magnitude. Instead, I would like to show "flows" representing a continuous choice that might have different values in different time periods, even for the same individual. The resulting diagram would look very similar to a sankey or alluvial plot, except that the alluvia or "flows" would gradually change widths between time periods. For example, suppose a customer buys the same subscription in two time periods, but uses it more frequently in the second period; that usage could be represented by a band or "flow" that increases in width from the first to the second time period.

  1. Does this chart type already exist anywhere? I was unable to find any examples in a fairly extensive search. If it doesn't exist, I hope that the value of such a chart type is clear and that someone will name and create it! :)
  2. How might such a graph be "hacked" in R using existing alluvial or sankey libraries? I imagine this is not trivial, since those chart types are defined by constant flows between nodes.

Example in R

I'll walk through an example using R to clarify the problem. Here's an example data set:

library(tidyr)
library(dplyr)
library(alluvial)
library(ggplot2)
library(forcats)

set.seed(42)
individual <- rep(LETTERS[1:10],each=2)
timeperiod <- paste0("time_",rep(1:2,10))
discretechoice <- factor(paste0("choice_",sample(letters[1:3],20, replace=T)))
continuouschoice <- ceiling(runif(20, 0, 100))
d <- data.frame(individual, timeperiod, discretechoice, continuouschoice)

I can visualize panel data for the discrete or categorical choice piece perfectly well. A stacked bar chart can be used to show how the number of individuals in each category changes over time. Alluvial or sankey diagrams can additionally show the individual movements that are causing changes in the category totals. For example:

# stacked bar diagram of discrete choice by individual
g <- ggplot(data=d,aes(timeperiod,fill=fct_rev(discretechoice)))
g + geom_bar(position="stack") + guides(fill=guide_legend(title=NULL))


# alluvial diagram of discrete choice by individual
d_alluvial <- d %>%
  select(individual,timeperiod,discretechoice) %>%
  spread(timeperiod,discretechoice) %>%
  group_by(time_1,time_2) %>%
  summarize(count=n()) %>%
  ungroup()
alluvial(select(d_alluvial,-count),freq=d_alluvial$count)

Stacked Bar and Alluvial Diagrams

I can also look at the continuous choice totals by category and across time periods by weighting the stacked bar chart.

# stacked bar diagram of discrete choice, weighting by continuous choice
g + geom_bar(position="stack",aes(weight=continuouschoice))

Weighted Stacked Bar

However, I cannot add any kind of individual "flows" across time periods to this weighted stacked bar chart. Those "flows" would have a different width in time period 1 than in time period 2, so they would need to be shown as gradually changing widths between the time periods. Sankey and alluvial diagrams, by contrast, have a single magnitude or width for each flow.

eipi10
  • 91,525
  • 24
  • 209
  • 285
Stuntz
  • 73
  • 1
  • 5

1 Answers1

9

I faced just this sort of confusion at the beginning of adapting the alluvial package to the ggplot2 framework. It's not uncommon for Sankey and alluvial diagrams to change weight from position to position, but alluvial was not built to handle data in a format suitable to encode it. (Edit: The alluvial_ts() function in alluvial was—see an example in the README—but it doesn't produce stacked histograms at each time period.)

One option may be to use the parallel set geoms in the development version of ggforce, though i'm not familiar with them myself. The other I'm aware of is my own, ggalluvial. Here's one solution to your problem, I think, using your dataset d (notice that the colors differ):

library(ggalluvial)
ggplot(
  data = d,
  aes(
    x = timeperiod,
    stratum = discretechoice,
    alluvium = individual,
    y = continuouschoice
  )
) +
  geom_stratum(aes(fill = discretechoice)) +
  geom_flow()

alluvial diagram in ggplot2

It's also possible to color the flows between the time periods; see the examples.

I couldn't find a good discussion of the differences in data formats, i.e. in which each row corresponds to one subject across all time periods versus one subject at one time period, so I tried to write one in the vignette. If you have any suggestions, I'd be glad to hear them!

Cory Brunson
  • 668
  • 4
  • 10
  • I haven't returned to this problem since your reply, but this is exactly the functionality I was looking for. The example you gave doesn't work any more though, since the weight aesthetic has been deprecated in ggalluvial. Simply replacing weight with y in your code threw an error for me: "Warning message: Computation failed in `stat_flow()`: unknown variable to group by : group_cols". If you edit your answer I'll accept it. Thanks so much for the response! – Stuntz Jul 14 '18 at 13:25
  • @Stuntz thanks for confirming your need and sorry to hear about the error. While the `weight` parameter is deprecated, internally it's converted to `y`, so the plot should still render as coded above. I just re-installed from CRAN and ran the full set of code above and was able to reproduce the diagram. It sounds like **ggalluvial** may require more recent versions of some **tidyverse** packages, in which case i need to add that info to my documentation. I'll experiment a bit to see if i can figure that out. – Cory Brunson Jul 15 '18 at 14:53
  • @CoryBrunson, you guessed right -- I updated tidyverse and was able to reproduce the plot. Thanks again! – Stuntz Jul 17 '18 at 00:54