4

Is it possible to count a repeating part of a sequence in R? For example:

x<- c(1,3.0,3.1,3.2,1,1,2,3.0,3.1,3.2,4,4,5,6,5,3.0,3.1,3.2,
      3.1,2,1,4,6,4.0,4,3.0,3.1,3.2,5,3.2,3.0,4)

Is it possible to count the times that the subsequence 3.0,3.1,3.2 occurs? So in this example it must be: 4

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • Do you just want to count that particular subsequence? Or do you want to identify any other subsequences that might be in your data? – A5C1D2H2I1M1N2O1R2T1 Jun 28 '13 at 14:13
  • 4
    Insert standard warning about matching floating-point values. Unless you need to keep everything numeric, you may want to run your data through `sprintf("%2f",mydata)` or equivalent so you can do exact matches on strings. – Carl Witthoft Jun 28 '13 at 14:34

4 Answers4

5

I'd do something like this:

pattern <- c(3, 3.1, 3.2)
len1 <- seq_len(length(x) - length(pattern) + 1)
len2 <- seq_len(length(pattern))-1
sum(colSums(matrix(x[outer(len1, len2, '+')], 
     ncol=length(len1), byrow=TRUE) == pattern) == length(len2))

PS: by changing sum to which you'll get the start of each instance.

Arun
  • 116,683
  • 26
  • 284
  • 387
3

One more (generic moving window) approach:

x <- c(1,3.0,3.1,3.2,1,1,2,3.0,3.1,3.2,4,4,5,6,5,3.0,3.1,3.2, 3.1,2,1,4,6,4.0,4,3.0,3.1,3.2,5,3.2,3.0,4)
s <- c(3, 3.1, 3.2)

sum(apply(embed(x, length(s)), 1, function(y) {all(y == rev(s))}))
# [1] 4

See output of embed to understand what's happening.

As Arun points out apply here is pretty slow, and one can use embed together with Arun's matrix trick to get this to be a lot faster:

sum(colSums(matrix(embed(x, length(s)),
                   byrow = TRUE, nrow = length(s)) == rev(s)) == length(s))
eddi
  • 49,088
  • 6
  • 104
  • 155
  • I ran across `embed` at first as well. But a vector scan required taking transpose. Or one should be using `apply`. I reverted therefore to constructing the matrix row-wise. – Arun Jun 28 '13 at 15:28
  • 1
    makes sense, I just tested and this is slightly faster than your `outer` approach when one gets rid of `apply` and does your `matrix` thing; I'll edit that approach in as well – eddi Jun 28 '13 at 15:32
2

You could turn it into a string, and use gregexpr.

sum(gregexpr("3 3.1 3.2", paste(x, collapse=" "), fixed=TRUE)[[1]] != -1)
[1]  4
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • 1
    This'll give an answer of 1 when there's no match, because `gregexpr` returns -1 in case of no match. – Arun Jun 28 '13 at 14:28
  • 2
    this gives incorrect results for overlapping sequences: `x = c(1,2,2,2,3,2,2); s = c(2,2)` – eddi Jun 28 '13 at 15:16
  • @eddi It's a bit silly to say it's "incorrect" when you actually have no idea what the OP wants to do with overlapping sequences, or in fact, if overlapping sequences need to be considered at all. – Hong Ooi Jun 28 '13 at 15:27
  • @HongOoi I think you should attempt to fix instead of getting defensive, the fact that this does smth else for overlapping sequences than the other solutions should at least give you a pause – eddi Jun 28 '13 at 15:40
  • @eddi Fixing something assumes there is something to fix. As I said, you don't actually know what is the desired output when sequences can overlap, or even if sequences can overlap. You have _assumed_ that they can occur, and you have _assumed_ that a particular output in that case is the correct one. Neither of these assumptions can be proved true until/unless the OP clarifies. – Hong Ooi Jun 28 '13 at 15:43
  • @HongOoi *shrug* extending that logic `print(4)` is also a solution to OP; just because it's an input *you* haven't anticipated, doesn't mean your response when someone points it out should be "that particular input is not in the OP"; anyway, if you don't want to fix it - that's fine obviously, hopefully my comment will be useful for whomever reads this answer in the future – eddi Jun 28 '13 at 15:50
  • @eddi Indeed, "4" is an answer to the question "count the times that the subsequence 3.0,3.1,3.2 occurs", in the specific sequence given. Because we are humans and not computers, however, we can understand from context that the specific numbers given are to be treated as placeholders for more general expressions. However, the specification of the problem as stated remains ambiguous for certain edge cases. I simply say that calling particular interpretations "incorrect" in the existence of ambiguity is... silly. – Hong Ooi Jun 28 '13 at 15:55
  • If we wanted to be utterly pedantic, in fact, the REAL answer is `TRUE`: it _is_ "possible to count the times...". I guess we've all been barking up the wrong tree.... – Hong Ooi Jun 28 '13 at 16:02
  • 2
    @HongOoi FWIW If you want to catch overlapping patterns, you could wrap the pattern in a lookahead assertion, and set `fixed=FALSE, perl=TRUE`. – Matthew Plourde Jun 28 '13 at 16:20
2

Carl Witthoft's seqle function might be useful for you here.

The function looks like this:

seqle <- function(x,incr=1) { 
    if(!is.numeric(x)) x <- as.numeric(x) 
    n <- length(x)  
    y <- x[-1L] != x[-n] + incr 
    i <- c(which(y|is.na(y)),n) 
    list(lengths = diff(c(0L,i)),
         values = x[head(c(0L,i)+1L,-1L)]) 
}

Applied to your data, it should look like this:

temp <- seqle(x, incr=.1)
temp
# $lengths
#  [1] 1 3 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1
# 
# $values
#  [1] 1.0 3.0 1.0 1.0 2.0 3.0 4.0 4.0 5.0 6.0 5.0 3.0 3.1 2.0 1.0 4.0
# [17] 6.0 4.0 4.0 3.0 5.0 3.2 3.0 4.0

Now, how do we read that? lengths tells us that our vector had a sequence of 1, then of 3, then of 1, and of 1, and of 1, and of 3.... values tells us that the first value of the sequence of length 3 was "3.0", the first value of the next sequence of length 3 was "3.0", and so on.

This is easier to see as a data.frame.

data.frame(temp)[temp$lengths > 1, ]
#    lengths values
# 2        3      3
# 6        3      3
# 12       3      3
# 20       3      3

In this example, the lengths of all the sequences are the same, and they start at the same value, so we can get your answer just by looking at the number of rows in the resulting data.frame above.

Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485