Data Visualization of User Path in R

Question

I am working on a clickstream dataset, where I have users, the pages they visited, and the path number (1 = starting page, 2 = next page they visited, etc). I am trying to visualize user paths. I thought a Sankey diagram would be best. But I am at a loss on how to convert the dataset into a Sankey diagram. Below is what my dataset looks like:

UserID	Page_Visited	Path_ID
123	Pg1	1
123	Pg2	2
123	Pg3	3
456	Pg1	1
456	Pg5	2

All I want to show is the cumulative path: x number of users started at Pg1, then go to Pg2 or Pg 5 or another Page. Something like a Sankey diagram.

1 > 2 > 3 > ...

I created a frequency dataset that looks like this where id = path number (1 = the starting page), Page = page_visited, and freq = count of users on that page at that path number:

id	Page	freq
1	create_message	1
1	home	153
1	about	97
2	create_message	10
2	pricing	21
2	home	155
2	contact	2
2	services	31
3	home	22
3	pricing	44
3	about	11

I would really appreciate some help here. How do I restructure my data or what code could get me going (I tried networkD3 package, but I think I am using it incorrectly)? Any help is much appreciated. If you think I should be using a different visualization and not Sankey, I am open to trying that too. Thank you.

please add a **reproducible** example of your data, using `dput` for example — gaut, Aug 02 '22 at 08:13
Update: ggalluvial package sort of got what I am trying to do. But, I am there yet. — user2845095, Aug 02 '22 at 08:16

score 1 · Answer 1 · 2022-09-12T19:59:10.713

I think that the package "ggsankey" can be very helpful in your case.

In the following code, I simulate a dataset where columns represent the order of the pages visited (from 1st page to 4th page) and each observation represents the pages visited by an individual (here, I simulate 10 individuals).

library(ggsankey)
library(ggplot2)
library(dplyr) 

df <- data.frame("id" = 1:10,
                 "first_page" = sample(x = c("home"), size = 10, replace = T),
                 "second_page" = sample(x = c("create_message", "pricing", "services"), size = 10, replace = T),
                 "third_page" = sample(x = c("create_message", "pricing", "services"), size = 10, replace = T),
                 "fourth_page" = sample(x = c("create_message", "pricing", "services"), size = 10, replace = T)
)

Then, I use the function make_long to give the data the format needed to plot.

df <- df %>%
  make_long(first_page, second_page, third_page, fourth_page)

Finally, I use ggplot to represent the Sankey diagram.

ggplot(df, aes(x = x, 
               next_x = next_x, 
               node = node, 
               next_node = next_node,
               fill = factor(node),
               label = node)) +
  geom_sankey(flow.alpha = 0.5,
              node.color = "black",
              show.legend = F) +
  geom_sankey_label() +
  theme_sankey(base_size = 16)

Here you can see the plot:

Sankey plot

In the following link you can find further information on the package and its application.

https://r-charts.com/es/flujo/diagrama-sankey-ggplot2/

Please, let me know if I can further help you. For the future, keep in mind that it is always better to supply a reproducible example of your code.

Cheers!

Pablo

Thanks a lot, @pablo_sama. The problem here is that make_long does not work for me. The ggsankey package did not download easily either. Is there an update on this? — user2845095, Sep 14 '22 at 08:13

Data Visualization of User Path in R

1 Answers1