Creating Sankey diagram in R; making the plot output interpretable

Question

I am creating Sankey diagrams for the first time in R, showing the connections between antecedent and consequent events and the number of times that they occur. Here is a mock example of the type of data that I am working with:-

#df creation=====================================================

df<-structure(list(Antecedent = c("Activity 1", "Activity 1", "Activity 1", 
                                  "Activity 1", "Activity 1", "Activity 2", "Activity 2", "Activity 2", 
                                  "Activity 2", "Activity 2", "Activity 3", "Activity 3", "Activity 3", 
                                  "Activity 3", "Activity 3", "Activity 4", "Activity 4", "Activity 4", 
                                  "Activity 4", "Activity 4", "Activity 5", "Activity 5", "Activity 5", 
                                  "Activity 5", "Activity 5"), 
                   Consequent = c("Activity 1", "Activity 2", 
                   "Activity 3", "Activity 4", "Activity 5", "Activity 1", "Activity 2", 
                   "Activity 3", "Activity 4", "Activity 5", "Activity 1", "Activity 2", 
                   "Activity 3", "Activity 4", "Activity 5", "Activity 1", "Activity 2", 
                   "Activity 3", "Activity 4", "Activity 5", "Activity 1", "Activity 2", 
                   "Activity 3", "Activity 4", "Activity 5"), 
                   count = c(1694888L,170L, 4060L, 0L, 7L, 255L, 46564L, 756L, 38L, 43L, 3926L, 523L, 
                                      303979L, 689L, 711L, 0L, 51L, 670L, 35210L, 383L, 13L, 59L, 800L, 
                                      508L, 14246L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
                                      -25L))

Here is the code that I am using to wrangle the data to make it amenable to the Sankey diagram function in the networkD3 library.

#libraries========================================
library(dplyr)
library(networkD3)


# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
  name=c(as.character(df$Antecedent),
         as.character(df$Consequent)) %>% unique()
)



# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
df$IDsource <- match(df$Antecedent, nodes$name)-1
df$IDtarget <- match(df$Consequent, nodes$name)-1



# Make the Network
p <- sankeyNetwork(Links = df, Nodes = nodes,
                   Source = "IDsource", Target = "IDtarget",
                   Value = "count", NodeID = "name",units = "%")
p

But if gives me a plot which looks terrible and almost un-interpretable:-

I was hoping that I would get something like the example in the link below (which is where I had found the code):-

Most basic Sankey Diagram

And I still want to achieve this kind of output. I think the most obvious issue is the naming conventions of my Antecedent and Consequent variables within my df as they are the same.

But I was wondering if there was still a way (without changing the naming convention within my df) to create a Sankey diagram similar to those in the link I had attached above. Can someone kindly provide a solution? Many thanks!

score 2 · Answer 1 · answered Jan 18 '22 at 12:43

If you want to stick with networkD3, I think you’ll need to disambiguate the node names in order to avoid the loops in the resulting graph.

library(dplyr)
library(networkD3)

# Disambiguate node names
links <- df %>% 
  mutate(
    Antecedent = paste("Antecedent", Antecedent),
    Consequent = paste("Consequent", Consequent),
  )

# Create a data frame for nodes
nodes <- links %>% 
  summarise(name = union(Antecedent, Consequent))

# Find node IDs for links
links$IDsource <- match(links$Antecedent, nodes$name) - 1
links$IDtarget <- match(links$Consequent, nodes$name) - 1

sankeyNetwork(
  Links = links,
  Nodes = nodes,
  Source = "IDsource",
  Target = "IDtarget",
  Value = "count",
  NodeID = "name"
) -> p
#> Links is a tbl_df. Converting to a plain data frame.
#> Nodes is a tbl_df. Converting to a plain data frame.

Alternatively, you could use ggplot2 with ggforce to create a static graph. It also requires some pre-processing to get the data in the right format:

library(ggplot2)

df %>% 
  ggforce::gather_set_data(1:2) %>% 
  ggplot(aes(x, split = y, id = id, value = count)) +
    ggforce::geom_parallel_sets(aes(fill = Antecedent)) +
    ggforce::geom_parallel_sets_axes(axis.width = 0.05) +
    ggforce::geom_parallel_sets_labels(
      angle = 0,
      hjust = 0,
      position = position_nudge(0.05)
    )

Creating Sankey diagram in R; making the plot output interpretable

1 Answers1

Linked