I am trying to create a sankey diagram, similar in concept to that seen here. The diagram I hope to create will likely have more intermediate nodes than the example provided.
I have been trying to use the networkd3 package, the sankeyNetwork
function specifically. My difficulty is getting the data I have into the correct structure for use with sankeyNetwork
.
My data is survey data where respondents are asked to rank metrics from most important to least important. For example,
W X Y Z
[1,] "Rank 1" "Rank 2" "Rank 3" "Rank 4"
[2,] "Rank 2" "Rank 3" "Rank 1" "Rank 4"
[3,] "Rank 1" "Rank 2" "Rank 3" "Rank 4"
[4,] "Rank 1" "Rank 2" "Rank 4" "Rank 3"
where W, X, Y & Z are the metrics.
To create a sankey, I need the data to be of the form:
0 1 10
0 2 5
1 3 2
The first column represents the start node (numbering starts from 0). The second column is the end node. The third column is the value/weight of the link connecting the nodes. There will also be a vector containing node names.
My ultimate goal is to have a sankey (travelling left to right) the first column nodes representing the metrics with their proportion of "Rank 1" votes received. Column 2 would again contain all the metrics but the links representing the proportion of "Rank 2" votes and so on until the last column contains the proportions of last place votes each metric received.
I am looking for a way to automate the data transformation (for want of a better word) as the data set I should have will have 7 metrics (thus 7 ranking positions) and responses from 50-100 people and thus there are many possible combinations of rankings.
At the moment I can use something similar to
example_data %>%
filter(W == "Rank 1" && X == "Rank 2") %>%
tally()
to provide the count but this requires me to write it out or loop through, every single possible combination of metrics and rankings . This in reality isn't really feasible for the size of data I am proposing to work with.
Edit: Thank you for feedback CJ Yetman. I had managed to already solve the issue and thus did not need to implement your answer but your solution is potentially a bit simpler than what I ended up doing.
I created a sankey_data data set containing the original data so that I would have a copy of the data to workwith.
sankey_data[["id"]] <- seq(1, nrow(sankey_data))
sankey_data <- sankey_data %>%
select(id, everything())
sankey_data <- apply(sankey_data, 2, as.character)
# Not necessarily required but I needed to convert the data points from factors
# to characters.
# Creating new variables to store data in more helpful format
sankey_data$Rank1 <- rep(NA, nrow(sankey_data))
sankey_data$Rank2 <- rep(NA, nrow(sankey_data))
sankey_data$Rank3 <- rep(NA, nrow(sankey_data))
sankey_data$Rank4 <- rep(NA, nrow(sankey_data))
# Filling in those new variables
ranking_levels <- c("Rank 1", "Rank 2", "Rank 3", "Rank 4")
for (i in 1:nrow(sankey_data)) {
for (j in 1:length(ranking_levels)) {
hold <- colnames(sankey_data[i, grep(sankey_data[i,],
pattern = paste0("^", ranking_levels[j]), fixed = F)])
sankey_data[i, 8 + j] <- hold
}
}
# Creating the Link data
Link1 <- sankey_data %>%
plyr::count(vars = c("Rank1", "Rank2")) %>%
mutate("link" = 1)
Link2 <- sankey_data %>%
plyr::count(vars = c("Rank2", "Rank3")) %>%
mutate("link" = 2)
Link3 <- sankey_data %>%
plyr::count(vars = c("Rank3", "Rank4")) %>%
mutate("link" = 3)
# I then added prefixes to each data point within links 1 - 3 respectively.
# I just used paste0 but won't include the detail here as this is additional to
# what is strictly necessary to create the Sankey.
# Adding column names
colnames(Link1) <- c("source", "target", "value", "link")
colnames(Link2) <- colnames(Link1)
colnames(Link3) <- colnames(Link1)
# Combing into a single data set
links <- rbind(Link1, Link2, Link3)
nodes <- data.frame(name = c(as.character(links[["source"]]),
as.character(links[["target"]])) %>% unique())
# As sankeyNetwork requires the nodes to be in numeric form (starting from 0),
# this serevs to convert the node names to numbers for input into the function
links[["IDsource"]] <- match(links[["source"]], nodes[["name"]]) - 1
links[["IDtarget"]] <- match(links[["target"]], nodes[["name"]]) - 1
# The Sankey
sankeyNetwork(Links = links,
Nodes = nodes,
Source = "IDsource",
Target = "IDtarget",
Value = "value",
fontFamily = "Arial",
NodeID = "name",
sinksRight = FALSE, fontSize = 24, height = 1400, width = 3200)
This code worked for me. I have tried to adapt it to work with the example data as I can't publish the actual data so there might be one or two artifacts that I have missed and don't make sense. Just let me know if that is the case and I will try to update it.