I am working on a clickstream dataset, where I have users, the pages they visited, and the path number (1 = starting page, 2 = next page they visited, etc). I am trying to visualize user paths. I thought a Sankey diagram would be best. But I am at a loss on how to convert the dataset into a Sankey diagram. Below is what my dataset looks like:
UserID | Page_Visited | Path_ID |
---|---|---|
123 | Pg1 | 1 |
123 | Pg2 | 2 |
123 | Pg3 | 3 |
456 | Pg1 | 1 |
456 | Pg5 | 2 |
All I want to show is the cumulative path: x number of users started at Pg1, then go to Pg2 or Pg 5 or another Page. Something like a Sankey diagram.
1 > 2 > 3 > ...
I created a frequency dataset that looks like this where id = path number (1 = the starting page), Page = page_visited, and freq = count of users on that page at that path number:
id | Page | freq |
---|---|---|
1 | create_message | 1 |
1 | home | 153 |
1 | about | 97 |
2 | create_message | 10 |
2 | pricing | 21 |
2 | home | 155 |
2 | contact | 2 |
2 | services | 31 |
3 | home | 22 |
3 | pricing | 44 |
3 | about | 11 |
I would really appreciate some help here. How do I restructure my data or what code could get me going (I tried networkD3 package, but I think I am using it incorrectly)? Any help is much appreciated. If you think I should be using a different visualization and not Sankey, I am open to trying that too. Thank you.