How to count a repeating repeating part of a sequence in R?

Question

Is it possible to count a repeating part of a sequence in R? For example:

x<- c(1,3.0,3.1,3.2,1,1,2,3.0,3.1,3.2,4,4,5,6,5,3.0,3.1,3.2,
      3.1,2,1,4,6,4.0,4,3.0,3.1,3.2,5,3.2,3.0,4)

Is it possible to count the times that the subsequence 3.0,3.1,3.2 occurs? So in this example it must be: 4

Do you just want to count that particular subsequence? Or do you want to identify any other subsequences that might be in your data? — A5C1D2H2I1M1N2O1R2T1, Jun 28 '13 at 14:13
Insert standard warning about matching floating-point values. Unless you need to keep everything numeric, you may want to run your data through `sprintf("%2f",mydata)` or equivalent so you can do exact matches on strings. — Carl Witthoft, Jun 28 '13 at 14:34

Arun · Accepted Answer · 2013-06-28T14:48:39.217

5

I'd do something like this:

pattern <- c(3, 3.1, 3.2)
len1 <- seq_len(length(x) - length(pattern) + 1)
len2 <- seq_len(length(pattern))-1
sum(colSums(matrix(x[outer(len1, len2, '+')], 
     ncol=length(len1), byrow=TRUE) == pattern) == length(len2))

PS: by changing sum to which you'll get the start of each instance.

edited Jun 28 '13 at 14:48

answered Jun 28 '13 at 13:58

Arun

116,683
26
284
387

eddi · Answer 2 · 2013-06-28T15:33:49.700

3

One more (generic moving window) approach:

x <- c(1,3.0,3.1,3.2,1,1,2,3.0,3.1,3.2,4,4,5,6,5,3.0,3.1,3.2, 3.1,2,1,4,6,4.0,4,3.0,3.1,3.2,5,3.2,3.0,4)
s <- c(3, 3.1, 3.2)

sum(apply(embed(x, length(s)), 1, function(y) {all(y == rev(s))}))
# [1] 4

See output of embed to understand what's happening.

As Arun points out apply here is pretty slow, and one can use embed together with Arun's matrix trick to get this to be a lot faster:

sum(colSums(matrix(embed(x, length(s)),
                   byrow = TRUE, nrow = length(s)) == rev(s)) == length(s))

edited Jun 28 '13 at 15:33

answered Jun 28 '13 at 15:08

eddi

49,088
6
104
155

I ran across `embed` at first as well. But a vector scan required taking transpose. Or one should be using `apply`. I reverted therefore to constructing the matrix row-wise. – Arun Jun 28 '13 at 15:28
1

makes sense, I just tested and this is slightly faster than your `outer` approach when one gets rid of `apply` and does your `matrix` thing; I'll edit that approach in as well – eddi Jun 28 '13 at 15:32

Hong Ooi · Answer 3 · 2013-06-28T14:33:05.007

2

You could turn it into a string, and use gregexpr.

sum(gregexpr("3 3.1 3.2", paste(x, collapse=" "), fixed=TRUE)[[1]] != -1)
[1]  4

edited Jun 28 '13 at 14:33

answered Jun 28 '13 at 13:50

Hong Ooi

56,353
13
134
187

1

This'll give an answer of 1 when there's no match, because `gregexpr` returns -1 in case of no match. – Arun Jun 28 '13 at 14:28
2

this gives incorrect results for overlapping sequences: `x = c(1,2,2,2,3,2,2); s = c(2,2)` – eddi Jun 28 '13 at 15:16
@eddi It's a bit silly to say it's "incorrect" when you actually have no idea what the OP wants to do with overlapping sequences, or in fact, if overlapping sequences need to be considered at all. – Hong Ooi Jun 28 '13 at 15:27
@HongOoi I think you should attempt to fix instead of getting defensive, the fact that this does smth else for overlapping sequences than the other solutions should at least give you a pause – eddi Jun 28 '13 at 15:40
@eddi Fixing something assumes there is something to fix. As I said, you don't actually know what is the desired output when sequences can overlap, or even if sequences can overlap. You have _assumed_ that they can occur, and you have _assumed_ that a particular output in that case is the correct one. Neither of these assumptions can be proved true until/unless the OP clarifies. – Hong Ooi Jun 28 '13 at 15:43
@HongOoi *shrug* extending that logic `print(4)` is also a solution to OP; just because it's an input *you* haven't anticipated, doesn't mean your response when someone points it out should be "that particular input is not in the OP"; anyway, if you don't want to fix it - that's fine obviously, hopefully my comment will be useful for whomever reads this answer in the future – eddi Jun 28 '13 at 15:50
@eddi Indeed, "4" is an answer to the question "count the times that the subsequence 3.0,3.1,3.2 occurs", in the specific sequence given. Because we are humans and not computers, however, we can understand from context that the specific numbers given are to be treated as placeholders for more general expressions. However, the specification of the problem as stated remains ambiguous for certain edge cases. I simply say that calling particular interpretations "incorrect" in the existence of ambiguity is... silly. – Hong Ooi Jun 28 '13 at 15:55
If we wanted to be utterly pedantic, in fact, the REAL answer is `TRUE`: it _is_ "possible to count the times...". I guess we've all been barking up the wrong tree.... – Hong Ooi Jun 28 '13 at 16:02
2

@HongOoi FWIW If you want to catch overlapping patterns, you could wrap the pattern in a lookahead assertion, and set `fixed=FALSE, perl=TRUE`. – Matthew Plourde Jun 28 '13 at 16:20

score 2 · Answer 4 · edited May 23 '17 at 11:50

2

Carl Witthoft's seqle function might be useful for you here.

The function looks like this:

seqle <- function(x,incr=1) { 
    if(!is.numeric(x)) x <- as.numeric(x) 
    n <- length(x)  
    y <- x[-1L] != x[-n] + incr 
    i <- c(which(y|is.na(y)),n) 
    list(lengths = diff(c(0L,i)),
         values = x[head(c(0L,i)+1L,-1L)]) 
}

Applied to your data, it should look like this:

temp <- seqle(x, incr=.1)
temp
# $lengths
#  [1] 1 3 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1
# 
# $values
#  [1] 1.0 3.0 1.0 1.0 2.0 3.0 4.0 4.0 5.0 6.0 5.0 3.0 3.1 2.0 1.0 4.0
# [17] 6.0 4.0 4.0 3.0 5.0 3.2 3.0 4.0

Now, how do we read that? lengths tells us that our vector had a sequence of 1, then of 3, then of 1, and of 1, and of 1, and of 3.... values tells us that the first value of the sequence of length 3 was "3.0", the first value of the next sequence of length 3 was "3.0", and so on.

This is easier to see as a data.frame.

data.frame(temp)[temp$lengths > 1, ]
#    lengths values
# 2        3      3
# 6        3      3
# 12       3      3
# 20       3      3

In this example, the lengths of all the sequences are the same, and they start at the same value, so we can get your answer just by looking at the number of rows in the resulting data.frame above.

edited May 23 '17 at 11:50

Community

1
1

answered Jun 28 '13 at 14:34

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

+1, even though I'm not sure if this is what the OP wants. For ex: the pattern could also be: `c(10, 8, 15)`. – Arun Jun 28 '13 at 14:44
@Arun, true. Just throwing it out there! – A5C1D2H2I1M1N2O1R2T1 Jun 28 '13 at 15:26
1

I have to recuse myself from giving it a +1 for obvious reasons :-). But I am flattered that you'd reference my derivative work. – Carl Witthoft Jun 28 '13 at 15:42
1

@CarlWitthoft, you're being modest :). The function is plenty useful! Go ahead, give it a +1 :). – Arun Jun 28 '13 at 15:55
1

@CarlWitthoft, I've done so more than once here on SO! When are you going to package this thing up? – A5C1D2H2I1M1N2O1R2T1 Jun 28 '13 at 16:43
Well to be honest I've been thinking of making a "my toys" package with `seqle` and `approxeq` (returns a vector of T/F rather than the `all.equal` output) and `short` (like a customizable combo of 'head' and 'tail'). Keep encouraging me, or send me Butterfingers bars, and I'll try to get it done :-) – Carl Witthoft Jun 28 '13 at 17:37
@AnandaMahto At long last, I've got the package submitted to CRAN. It's called `cgwtools` ; currently awaiting CRAN approval. – Carl Witthoft Aug 10 '13 at 23:32
@CarlWitthoft Awesome. I'll look out for it! – A5C1D2H2I1M1N2O1R2T1 Aug 11 '13 at 02:16

How to count a repeating repeating part of a sequence in R?

4 Answers4