1

I've been spending most of yesterdays time on the following problem and haven't found a solution yet to the following problem:

I have a dataframe with categorical data: say category1: has values A and B; Antother column category2 has values C, D, F, G; category3 has values H and so on...

I want to make a Sankey diagram showing how many (through the widths of the bands from node to node) from category1 A are in C, D, F, G. And this for all other combinations in the grouped dataframe as well.

It's basically a tree with the width of the branches showing how many counts are in the particular branch.

Is there a way on how to do this in a flexible way so that it works for most groupings in categorical DF's?

DCB
  • 107
  • 12

2 Answers2

2

You can try with the nice ggalluvial package:

library(ggalluvial)
library(ggplot2)

# some fake data
data <- data.frame(column1 = c('A','A','A','B','B','B')
                   ,column2 = c('C','D','E','C','D','E')
                   , column3 = c('F','G','H','I','J','K')
                               )

# add a costant as frequencies: if each "flow" count as 1, you can do this
data$freq <- 1

# here the plot
ggplot(data,
       aes(y = freq, axis1 = column1, axis2 = column2, axis3 = column3)) +
  geom_alluvium(aes(), width = 1/12) +
  geom_stratum(width = 1/12, fill = "black", color = "blue") +
  geom_label(stat = "stratum", label.strata = TRUE)  +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  ggtitle("nice sankey")

enter image description here

s__
  • 9,270
  • 3
  • 27
  • 45
0

If you're willing to rearrange you're data into a node list and an edge list, you can take advantage of the D3 javascript library with the networkD3 package. Here's an example with dummy data (note that to use this library you need to have an id column which starts with 0.

library(tidyverse)

nodes <- tibble(id = c(0:9), label = c(1:10))

edges <- tibble(from = c(5:15, 0:4, 16:19), to = (0:19), weight = rnorm(20))

library(networkD3)

sankeyNetwork(Links = edges, 
              Nodes = nodes, 
              Source = "from", 
              Target = "to", 
              NodeID = "label", 
              Value = "weight")
Ben G
  • 4,148
  • 2
  • 22
  • 42