0

I am trying to develop a sankey chart to visualize a customer journey on a website. My data has two fields: Session_ID and Page_Name. I set a limit to page depth to have a maximum of 6 pages per session.

I was able to create the nodes, but not able to create links. Links has to be of the form (source, target, frequency). Below is my data structure:

test_data = data.frame(session = rep(1:4, each = 4),
                       page = c("a","b","c","d", "a","c","d","e","a","b","d","c","a","d","e","f"))

This should be the final data:

a,b,2
b,c,1
c,d,2
a,c,1
d,e,2
b,d,1
d,c,1
a,d,1
d,f,1
CJ Yetman
  • 8,373
  • 2
  • 24
  • 56
user3252148
  • 153
  • 1
  • 3
  • 11
  • Is this supposed to be grouping within sessions? I have trouble getting to the expected output with your example data, e.g. session 1 is all 'a'. Did you want `session = rep(1:4, each = 4)`? – Marius May 17 '19 at 05:21
  • Sorry. yes you are right. It's each = 4 – user3252148 May 17 '19 at 05:24

1 Answers1

2

You can do this using dplyr - since the pages are in order of visits, you can use lead() to get the next page:

library(dplyr)

test_data %>%
    group_by(session) %>%
    mutate(next_page = lead(page)) %>%
    ungroup() %>%
    count(page, next_page) %>%
    filter(! is.na(next_page)) 
Marius
  • 58,213
  • 16
  • 107
  • 105