rle
is your friend:
v <- c(1, 1, 5, 5, 2, 2, 6, 6, 1, 2, 2, 2, 2, 2, 2, 3)
rv <- rle(v)
unlist(sapply(seq_along(rv$lengths), function(i)
c(rv$values[i], rep(NA, rv$lengths[i]-1))))
## [1] 1 NA 5 NA 2 NA 6 NA 1 2 NA NA NA NA NA 3
Explanation: rle
returns a list consisting of 2 vectors, lenghts
and values
:
unclass(rv)
## $lengths
## [1] 2 2 2 2 1 6 1
##
## $values
## [1] 1 5 2 6 1 2 3
from which we may create the result. The first value, 1
, occurs 2
times in the input vector. So in the output we want 1
and 2-1
NA
s to follow. Then 5
occurs 2
times, so we get 5, NA
, and so on.
EDIT: However, this solution is quite slow (comparing the other listed):
set.seed(123L)
v <- sample(1:5, 10000, replace=TRUE)
library(microbenchmark)
microbenchmark(...)
## Unit: milliseconds
## min lq median uq max neval
## @Stephan 1.161341 1.193744 1.230734 1.248493 5.867357 100
## @beginneR 2.568235 2.618651 2.655130 3.034742 8.837571 100
## @gagolews 102.307481 111.128368 117.279179 121.308154 195.238260 100
EDIT2: As my really slow rle
-based solution got accepted, here's an Rcpp-based solution for speed lovers:
library(Rcpp)
cppFunction("
NumericVector duptrack(NumericVector v) {
int n = v.size();
NumericVector out(Rcpp::clone(v));
for (int i=1; i<n; ++i)
if (v[i] == v[i-1])
out[i] = NA_REAL;
return out;
}
")
Benchmarks:
## Unit: milliseconds
## min lq median uq max neval
## @gagolews-Rcpp 0.077296 0.080160 0.0832595 0.089952 2.31203 100
## @Stephan 1.161027 1.167035 1.1759645 1.223393 6.21994 100
EDIT3: As of all R code, we should also be interested in how the solutions deal with vectors with missing values.
For v <- c(1,1,NA,2,NA,2,2)
we get:
1 NA NA 2 NA 2 NA
-- @gagolews
1 NA NA 2 NA 2 NA
-- @Stephan
NA NA NA NA NA NA NA
-- @beginneR with stats::lag
1 NA NA 2 NA 2 NA
-- @beginneR with dplyr::lag
1 NA NA 2 NA 2 NA
-- @gagolews-Rcpp