0

I am working on a clickstream dataset, where I have users, the pages they visited, and the path number (1 = starting page, 2 = next page they visited, etc). I am trying to visualize user paths. I thought a Sankey diagram would be best. But I am at a loss on how to convert the dataset into a Sankey diagram. Below is what my dataset looks like:

UserID Page_Visited Path_ID
123 Pg1 1
123 Pg2 2
123 Pg3 3
456 Pg1 1
456 Pg5 2

All I want to show is the cumulative path: x number of users started at Pg1, then go to Pg2 or Pg 5 or another Page. Something like a Sankey diagram.

1 > 2 > 3 > ...

I created a frequency dataset that looks like this where id = path number (1 = the starting page), Page = page_visited, and freq = count of users on that page at that path number:

id Page freq
1 create_message 1
1 home 153
1 about 97
2 create_message 10
2 pricing 21
2 home 155
2 contact 2
2 services 31
3 home 22
3 pricing 44
3 about 11

I would really appreciate some help here. How do I restructure my data or what code could get me going (I tried networkD3 package, but I think I am using it incorrectly)? Any help is much appreciated. If you think I should be using a different visualization and not Sankey, I am open to trying that too. Thank you.

user2845095
  • 465
  • 2
  • 9

1 Answers1

1

I think that the package "ggsankey" can be very helpful in your case.

In the following code, I simulate a dataset where columns represent the order of the pages visited (from 1st page to 4th page) and each observation represents the pages visited by an individual (here, I simulate 10 individuals).

library(ggsankey)
library(ggplot2)
library(dplyr) 

df <- data.frame("id" = 1:10,
                 "first_page" = sample(x = c("home"), size = 10, replace = T),
                 "second_page" = sample(x = c("create_message", "pricing", "services"), size = 10, replace = T),
                 "third_page" = sample(x = c("create_message", "pricing", "services"), size = 10, replace = T),
                 "fourth_page" = sample(x = c("create_message", "pricing", "services"), size = 10, replace = T)
)

Then, I use the function make_long to give the data the format needed to plot.

df <- df %>%
  make_long(first_page, second_page, third_page, fourth_page)

Finally, I use ggplot to represent the Sankey diagram.

ggplot(df, aes(x = x, 
               next_x = next_x, 
               node = node, 
               next_node = next_node,
               fill = factor(node),
               label = node)) +
  geom_sankey(flow.alpha = 0.5,
              node.color = "black",
              show.legend = F) +
  geom_sankey_label() +
  theme_sankey(base_size = 16)

Here you can see the plot:

Sankey plot

In the following link you can find further information on the package and its application.

https://r-charts.com/es/flujo/diagrama-sankey-ggplot2/

Please, let me know if I can further help you. For the future, keep in mind that it is always better to supply a reproducible example of your code.

Cheers!

Pablo

  • Thanks a lot, @pablo_sama. The problem here is that make_long does not work for me. The ggsankey package did not download easily either. Is there an update on this? – user2845095 Sep 14 '22 at 08:13