I am creating Sankey diagrams for the first time in R, showing the connections between antecedent and consequent events and the number of times that they occur. Here is a mock example of the type of data that I am working with:-
#df creation=====================================================
df<-structure(list(Antecedent = c("Activity 1", "Activity 1", "Activity 1",
"Activity 1", "Activity 1", "Activity 2", "Activity 2", "Activity 2",
"Activity 2", "Activity 2", "Activity 3", "Activity 3", "Activity 3",
"Activity 3", "Activity 3", "Activity 4", "Activity 4", "Activity 4",
"Activity 4", "Activity 4", "Activity 5", "Activity 5", "Activity 5",
"Activity 5", "Activity 5"),
Consequent = c("Activity 1", "Activity 2",
"Activity 3", "Activity 4", "Activity 5", "Activity 1", "Activity 2",
"Activity 3", "Activity 4", "Activity 5", "Activity 1", "Activity 2",
"Activity 3", "Activity 4", "Activity 5", "Activity 1", "Activity 2",
"Activity 3", "Activity 4", "Activity 5", "Activity 1", "Activity 2",
"Activity 3", "Activity 4", "Activity 5"),
count = c(1694888L,170L, 4060L, 0L, 7L, 255L, 46564L, 756L, 38L, 43L, 3926L, 523L,
303979L, 689L, 711L, 0L, 51L, 670L, 35210L, 383L, 13L, 59L, 800L,
508L, 14246L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-25L))
Here is the code that I am using to wrangle the data to make it amenable to the Sankey diagram function in the networkD3
library.
#libraries========================================
library(dplyr)
library(networkD3)
# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
name=c(as.character(df$Antecedent),
as.character(df$Consequent)) %>% unique()
)
# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
df$IDsource <- match(df$Antecedent, nodes$name)-1
df$IDtarget <- match(df$Consequent, nodes$name)-1
# Make the Network
p <- sankeyNetwork(Links = df, Nodes = nodes,
Source = "IDsource", Target = "IDtarget",
Value = "count", NodeID = "name",units = "%")
p
But if gives me a plot which looks terrible and almost un-interpretable:-
I was hoping that I would get something like the example in the link below (which is where I had found the code):-
And I still want to achieve this kind of output. I think the most obvious issue is the naming conventions of my Antecedent
and Consequent
variables within my df
as they are the same.
But I was wondering if there was still a way (without changing the naming convention within my df
) to create a Sankey diagram similar to those in the link I had attached above. Can someone kindly provide a solution? Many thanks!