You can take the diff
of is.na(x)
. This will be 1
IFF the element is TRUE
and the previous element is FALSE
. After applying == 1
, you have a logical vector which is TRUE
for NA
-group starts. Then you can take the cumsum
to identify which NA
-group you're in, and multiply by is.na(x)
to set ne non-NA
s to 0
.
cumsum(diff(is.na(c(1, x))) == 1)*is.na(x)
#[1] 1 1 0 0 2 2 0 0
Intermediate results displayed:
a <- is.na(c(1, x))
a
#[1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
b <- diff(a) == 1
b
#[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
d <- cumsum(b)
d
#[1] 1 1 1 1 2 2 2 2
I was interested so I did a benchmark. I don't think the results matter practically though, the difference is in milliseconds even for length(x)
of 1e7
.
x <- c(NA,NA, 1,2,NA,NA, 3,4)
x <- sample(x, 1e7, T)
f_rleid <- function(x){
nax <- is.na(x)
r <- rleid(x)*nax
r[nax] <- rleid(r[nax])
r
}
f_rle <- function(x){
r <- rle(is.na(x))
r$values <- cumsum(r$values) * r$values
inverse.rle(r)
}
f_diffna <- function(x){
nax <- is.na(x)
cumsum(c(as.integer(nax[1]), diff(nax)) == 1L)*nax
}
all.equal(f_rleid(x), f_rle(x))
# [1] TRUE
all.equal(f_rleid(x), f_diffna(x))
# [1] TRUE
microbenchmark::microbenchmark(f_rleid(x), f_rle(x),f_diffna(x))
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_rleid(x) 421.9483 437.3314 469.3564 446.5081 511.9315 582.5812 100
# f_rle(x) 451.3790 519.5278 560.8057 572.4148 591.7632 697.2100 100
# f_diffna(x) 248.3631 267.5462 315.6224 291.5910 362.8829 459.6873 100