I am trying to count all sequences in a large list of characters delimetered by ">" but only the combinations that are directly next to each other.
e.g. given the character vector:
[1]Social>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>OrganicSearch>OrganicSearch>OrganicSearch
[2]Referral>Referral>Referral
I can run the following line to retrieve all combinations with of 2 characters:
split_fn <- sapply(p , strsplit , split = ">", perl=TRUE)
split_fn <- sapply(split_fn, function(x) paste(head(x,-1) , tail(x,-1) , sep = ">") )
Returns:
[[1]]
[1] "Social>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch"
[6] "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch"
[11] "PaidSearch>OrganicSearch" "OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch"
[[2]]
[1] "Referral>Referral" "Referral>Referral"
Which is all possible 2 character sequences in my data (splits in order)
I know want to have all possible outcomes of 3 characters.
e.g.
"Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"..."Referral>Referral>Referral"
Tried to use
unlist(lapply(strsplit(p, split = ">"), function(i) combn(sort(i), 3, paste, collapse='>')))
But it returns all combinations including those that aren't directly following.
I also don't want it to return combinations of the last value in row one with the first value in row 2 etc.