0

Is there any easy way to get maximum number of consecutive 1's in a string like: "000010011100011111001111111100" ?

I, definitely, can do it with loops but I'd like to avoid that since my actual dataset has about 500,000 records.

Thanks for your help in advance.

Thomas
  • 43,637
  • 12
  • 109
  • 140
Sam
  • 4,357
  • 6
  • 36
  • 60
  • What have you tried (and other questions from the [Stack Overflow question checklist](http://meta.stackexchange.com/questions/156810/stack-overflow-question-checklist))? – Joshua Ulrich Aug 01 '13 at 21:01
  • I only tried using loops. I have two loops one as a counter on row number that starts from the first row of the dataset and goes all the way to the end. Another loop as a counter of number of consecutive 1's. But it's very inefficient and takes a long time to run. – Sam Aug 01 '13 at 21:04
  • @Thomas, you are right. I searched but I didn't find anything. I should've used better keywords to search. – Sam Aug 01 '13 at 21:13

3 Answers3

7

Using rle is slower and a bit more clumsy than using regular expressions. In Thomas' answer, you're still left to extract the max length when the values equal 1.

# make some data
set.seed(21)
N <- 1e5
s <- sample(c("0","1"), N*30, TRUE)
s <- split(s, rep(1:N, each=30))
s <- sapply(s, paste, collapse="")
# Thomas' (complete) answer
r <- function(S) {
  sapply(S, function(x) {
    rl <- rle(as.numeric(strsplit(x,"")[[1]]))
    max(rl$lengths[rl$values==1])
  })
}
# using regular expressions
g <- function(S) sapply(gregexpr("1*",S),
   function(x) max(attr(x,'match.length')))
# timing
system.time(R <- r(s))
#    user  system elapsed 
#    6.41    0.00    6.41
system.time(G <- g(s))
#    user  system elapsed 
#    1.47    0.00    1.46
all.equal(R,G)
# [1] "names for target but not for current"
Community
  • 1
  • 1
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
6

An alternative much faster way without using rle would be to split with consecutive 0's as follows:

# following thelatemail's comment, changed '0+' to '[^1]+'
strsplit(x, "[^1]+", perl=TRUE)

Then you can loop over and get maximum characters for each element of your list. This'll be faster than rle solution as well. and is also faster than the gregexpr solution from @Joshua. Some benchmarking...

zz <- function(x) {
    vapply(strsplit(x, "[^1]+", perl=TRUE), function(x) max(nchar(x)), 0L)
}

I just realised that @Joshua's function could also be tweaked by adding perl=TRUE and using vapply. So, I'll compare that as well.

g2 <- function(S) vapply(gregexpr("1*",S, perl=TRUE),
   function(x) max(attr(x,'match.length')), 0L)

require(microbenchmark)
microbenchmark(t1 <- zz(unname(s)), t2 <- g(unname(s)), t3 <- g2(unname(s)), times=50)
Unit: seconds
                expr      min       lq   median       uq      max neval
 t1 <- zz(unname(s)) 1.187197 1.285065 1.344371 1.497564 1.565481    50
  t2 <- g(unname(s)) 2.154038 2.307953 2.357789 2.417259 2.596787    50
 t3 <- g2(unname(s)) 1.562661 1.854143 1.914597 1.954795 2.203543    50

identical(t1, t2) # [1] TRUE
identical(t1, t3) # [1] TRUE
Arun
  • 116,683
  • 26
  • 284
  • 387
4

Use rle:

x <- "000010011100011111001111111100"
rr <- rle(strsplit(x,"")[[1]])

Run Length Encoding
  lengths: int [1:9] 4 1 2 3 3 5 2 8 2
  values : chr [1:9] "0" "1" "0" "1" "0" "1" "0" "1" "0"

Note: I removed the as.numeric part as it's not necessary. From here, you can get the maximum count of consecutive 1's with:

max(rr$lengths[which(rr$values == "1")])
# [1] 8
Arun
  • 116,683
  • 26
  • 284
  • 387
Thomas
  • 43,637
  • 12
  • 109
  • 140
  • @Arun - I think that should be a separate answer rather than an edit. If you do so, I can probably delete mine then. – thelatemail Aug 01 '13 at 23:00
  • @thelatemail, Yes, I realise that now. posted separately. Thanks. (Thomas, sorry for the mess). – Arun Aug 01 '13 at 23:02
  • How can I do this if I want to create a separate column? I tried doing it for a column and i am getting the same value for all rows. Any suggestions? – user3570187 Apr 16 '20 at 11:09