3

I am trying to count all sequences in a large list of characters delimetered by ">" but only the combinations that are directly next to each other.

e.g. given the character vector:

[1]Social>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>OrganicSearch>OrganicSearch>OrganicSearch
[2]Referral>Referral>Referral

I can run the following line to retrieve all combinations with of 2 characters:

split_fn <- sapply(p , strsplit , split = ">", perl=TRUE)

split_fn <- sapply(split_fn, function(x) paste(head(x,-1) , tail(x,-1) , sep = ">") )

Returns:

[[1]]

 [1] "Social>PaidSearch"           "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"      
 [6] "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"      
[11] "PaidSearch>OrganicSearch"    "OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch"

[[2]]

[1] "Referral>Referral" "Referral>Referral"

Which is all possible 2 character sequences in my data (splits in order)

I know want to have all possible outcomes of 3 characters.

e.g.

"Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"..."Referral>Referral>Referral"

Tried to use

unlist(lapply(strsplit(p, split = ">"), function(i) combn(sort(i), 3, paste, collapse='>')))

But it returns all combinations including those that aren't directly following.

I also don't want it to return combinations of the last value in row one with the first value in row 2 etc.

nicola
  • 24,005
  • 3
  • 35
  • 56

3 Answers3

2

Let's start with creating some data:

set.seed(1)

data <- lapply(1:3, function(i) sample(LETTERS[1:3], rpois(1, 6), re = T))
data <- sapply(data, paste, collapse = ">")

data
#> [1] "B>B>C>A"           "C>B>B>A>A>A>C>B>C" "C>C>B>C>C>A"

Given the problem, it makes sense to think of these data as a list of vectors that we get after splitting the elements by the delimiter >:

strsplit(data, ">")
#> [[1]]
#> [1] "B" "B" "C" "A"
#> 
#> [[2]]
#> [1] "C" "B" "B" "A" "A" "A" "C" "B" "C"
#> 
#> [[3]]
#> [1] "C" "C" "B" "C" "C" "A"

Now, the core of the problem is to find all consecutive sequences of a given length from a single vector. Once we can do that, it's simple to apply over the list of data that we have; transforming back to the delimited format will also be simple.

With that goal in mind, we can then make a function for extracting the sequences; here we just loop over each element and extract all sequences of the given length to a list:

seqs <- function(x, length = 2) {
  if (length(x) < length)
    return(NULL)
  k <- length - 1
  lapply(seq_len(length(x) - k), function(i) x[i:(i + k)])
}

We can now just apply the function accross the data after splitting the delimited characters into vectors to get the result. We also need an additional sapply with paste to transform the data back into the delimited format that we started with:

lapply(strsplit(data, ">"), function(x) {
  sapply(seqs(x, 3), paste, collapse = ">")
})
#> [[1]]
#> [1] "B>B>C" "B>C>A"
#> 
#> [[2]]
#> [1] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#> 
#> [[3]]
#> [1] "C>C>B" "C>B>C" "B>C>C" "C>C>A"

Further, to get sequences of multiple lengths at the same time, we can add another layer of iteration:

lapply(strsplit(data, ">"), function(x) {
  unlist(sapply(c(2, 3), function(n) {
    sapply(seqs(x, n), paste, collapse = ">")
  }))
})
#> [[1]]
#> [1] "B>B"   "B>C"   "C>A"   "B>B>C" "B>C>A"
#> 
#> [[2]]
#>  [1] "C>B"   "B>B"   "B>A"   "A>A"   "A>A"   "A>C"   "C>B"   "B>C"  
#>  [9] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#> 
#> [[3]]
#> [1] "C>C"   "C>B"   "B>C"   "C>C"   "C>A"   "C>C>B" "C>B>C" "B>C>C" "C>C>A"

Created on 2018-05-21 by the reprex package (v0.2.0).

Mikko Marttila
  • 10,972
  • 18
  • 31
  • That works perfect. I have a further question which is just to make my life a bit easier so I don't have to do any merging later. Is there an easy modification to this code that would allow me to get both sequences with 2 characters and 3 characters. i.e. "B>B", "B>C", "B>B>C" etc. – Jamie Allan May 21 '18 at 06:14
  • Hmm.. Sure: that just requires an additional `sapply` layer in the end. I'll edit accordingly. – Mikko Marttila May 21 '18 at 06:27
  • Thanks so much for your help. Much appreciated. – Jamie Allan May 21 '18 at 06:35
  • @JamieAllan No problem, and welcome to SO! As your problem is solved, you could also [accept an answer](https://stackoverflow.com/help/someone-answers). – Mikko Marttila May 21 '18 at 06:40
  • I have just tried to increase my list size as it won't let me go more than 10. In fact, I have 24,000,000 rows that I eventually need to apply the code on. The code in my original post was able to cycle through all 24M rows and returned a table tallying all sequences. There is a definitive number of sequences depending on the order of the model I am using. In the first case (2 chars) there is 81 possible combinations. In the second case (2-3 chars) there is 810 possible combinations. Exponentially increasing with the higher order. I assumed this code would work but I guess not on a large scale – Jamie Allan May 21 '18 at 06:57
  • Returns Error: _Error in seq_len(length(x) - k) : argument must be coercible to non-negative integer_ – Jamie Allan May 21 '18 at 07:01
  • That happens because there are data that are shorter than the sequence length that you are looking for; that case needs to be handled separately in the `seqs` function: see the edited version. – Mikko Marttila May 21 '18 at 07:40
  • Ah...Yes I just realized I had rows with only 1 character so there was no delimiter etc. Thanks again, I will accept the answer now – Jamie Allan May 21 '18 at 21:14
1

Using the stringr package (or regex in general).

library(stringr)
str_extract_all(p, "(\\w+)>(\\w+)>(\\w+)")

With overlap, but the code could be simplified.

str_extract_all_overlap <- function (x) {
  extractions <- character()
  x_curr <- x
  extr <- str_extract(x_curr, "(\\w+)>(\\w+)>(\\w+)")
  i = 1
  while (!is.na(extr)) {
    extractions[i] <- extr 
    x_curr <- str_replace(x_curr, "\\w+", replacement = "")
    extr <- str_extract(x_curr, "(\\w+)>(\\w+)>(\\w+)")
    i = i + 1
  }
  return(extractions)
}

lapply(p, str_extract_all_overlap)
Nic
  • 363
  • 2
  • 8
  • 1
    I think there should be 12 extracted chunks when accounting for overlaps. This only gets the 4 discrete parts with no overlap. – thelatemail May 21 '18 at 05:44
  • This is a good start. However, thelatemail is correct, this only returns 5 discrete parts, 4 for the first character and 1 for the 2nd. I need to have the overlapping occurances too. – Jamie Allan May 21 '18 at 05:50
  • Addressed in the edit, but there is probably a simpler solution with fewer lines of code. – Nic May 21 '18 at 06:29
  • Thanks for your input – Jamie Allan May 21 '18 at 06:43
0

You could also adapt the paste-command in your second sapply to:

paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2) , sep = ">")

Your full code should now look like:

split_fn <- sapply(p , strsplit , split = ">", USE.NAMES = FALSE)

split_fn <- sapply(split_fn, function(x) paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2), sep = ">") )

The result:

> split_fn
[[1]]
 [1] "Social>PaidSearch>PaidSearch"              "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"         
 [4] "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"         
 [7] "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"         
[10] "PaidSearch>PaidSearch>OrganicSearch"       "PaidSearch>OrganicSearch>OrganicSearch"    "OrganicSearch>OrganicSearch>OrganicSearch"

[[2]]
[1] "Referral>Referral>Referral"
h3rm4n
  • 4,126
  • 15
  • 21
  • Nice. The head(tail(x,-1),-1) line is what I was initially looking for. I was trying to modify head and tail seperately but didn't think of that. However, the solution above is a much cleaner way and gets extra information which I was going to find by merging this with my previous results. Thanks for the input – Jamie Allan May 21 '18 at 21:28