Recently, I have been learning about Sankey Diagrams and have been trying to recreate them in R (https://rpubs.com/dmormandy/DV_Sankey).
Here is my imaginary problem: I have a list of people with their first name, middle name and last name. They have common names and their names repeat. I am trying to make a 3 layer (each name is a layer) Sankey Diagram that shows how the names become more unique when you go to the middle name and last name. E.g. there are many "John's", but there is only one "John Claude Frank".
Here is the fake data I created:
library(dplyr)
name_data <- data.frame(
"City" = c("Paris", "Paris", "Paris", "Paris", "Paris", "London", "London", "London", "Paris", "London", "Paris"),
"First_Name" = c("John", "John", "John", "John", "John", "John", "James", "James", "Adam", "Adam", "Henry"),
"Middle_Name" = c("Claude", "Claude", "Claude", "Smith", "Smith", "Peters", "Stevens", "Stevens", "Ford", "Tom", "Frank"),
"Last Name " = c("Tony", "Tony", "Frank", "Carson", "Phil", "Lewis", "Eric", "David", "Roberts", "Scott", "Xavier")
)
From here, I understand that you need to create a "links" object and a "nodes" object which will be used to specify the relationships between the names. For the time being, I tried to create an additional column (in Microsoft Excel) that counts the total number of times each name appears in a given column, and then places that number beside the said name. But I don't think this is the right way to solve this problem.
I tried to accomplish this in R, but I don't think I am doing it right:
dats_all <- name_data %>% # data
group_by( First_Name, Middle_Name, Last.Name.) %>% # group them
summarise(Freq = n()) # add frequencies
I found this website over here and I am trying to recreate it with my data: https://ggforce.data-imaginist.com/reference/geom_parallel_sets.html
library(ggforce)
library(reshape2)
name_data <- data.frame(
"City" = c("Paris", "Paris", "Paris", "Paris", "Paris", "London", "London", "London", "Paris", "London", "Paris"),
"First_Name" = c("John", "John", "John", "John", "John", "John", "James", "James", "Adam", "Adam", "Henry"),
"Middle_Name" = c("Claude", "Claude", "Claude", "Smith", "Smith", "Peters", "Stevens", "Stevens", "Ford", "Tom", "Frank"),
"Last Name " = c("Tony", "Tony", "Frank", "Carson", "Phil", "Lewis", "Eric", "David", "Roberts", "Scott", "Xavier")
)
name_data$ID <- seq.int(nrow(name_data))
data <- reshape2::melt(name_data)
data <- gather_set_data(name_data)
ggplot(name_data, aes(x, id = ID, split = First_Name, value = value)) +
geom_parallel_sets(aes( alpha = 0.3, axis.width = 0.1) +
geom_parallel_sets_axes(axis.width = 0.1) +
geom_parallel_sets_labels(colour = 'white')
I think I am getting closer, but there are still some problems. Could someone please show me how to me a basic Sankey Diagram in R (preferably where it shows the number of people at each level, something like this : https://i.redd.it/lrjdj45xrpo21.png)?