3

If you have a dataframe like so:

v <- c(1, 1, 5, 5, 2, 2, 6, 6, 1, 2, 2, 2, 2, 2, 2, 3)
w <- data.frame(v)

How can you remove the repeated values in w and replacing them with NA, only for the values that are repeated immediately after a value, so that your new data frame looks like this?

v <- c(1, NA, 5, NA, 2, NA, 6, NA, 1, 2, NA, NA, NA, NA, NA, 3)
w <- data.frame(v)

Note how the 2 appears consecutively twice and is retained every time it appears and the immediately repeating values are all removed?

I've searched SO and I'm seeing responses to remove every repeating value using the unique and duplicated functions, but that's not what I'm searching for. I'm hoping that there is a package in R that can do this without using a function.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Reuben Mathew
  • 598
  • 4
  • 22

4 Answers4

6

The key is to check differences using diff() and to fill with NA whenever a difference is zero:

> result <- v
> result[c(FALSE,diff(v)==0)] <- NA
> result
 [1]  1 NA  5 NA  2 NA  6 NA  1  2 NA NA NA NA NA  3
> 
Stephan Kolassa
  • 7,953
  • 2
  • 28
  • 48
5

Or a simple ifelse:

ifelse(lag(v,1) == v & !is.na(lag(v,1)), NA, v)
#[1]  1 NA  5 NA  2 NA  6 NA  1  2 NA NA NA NA NA  3

Edit: in case the original vector contains NAs, it is best to use dplyr::lag instead of stats::lag.

talat
  • 68,970
  • 21
  • 126
  • 157
  • Thank you for your time and efforts sir. Now I have learned three new technique for R. This is more than I could have asked for. – Reuben Mathew May 23 '14 at 12:03
  • @ReubenMathew very welcome. I also find it very interesting to see the variety and number of different ways to do the same operation in R – talat May 23 '14 at 12:06
  • 1
    @gagolews thanks, also for editing your question to reflect the difference – talat May 23 '14 at 13:07
4

rle is your friend:

v <- c(1, 1, 5, 5, 2, 2, 6, 6, 1, 2, 2, 2, 2, 2, 2, 3)
rv <- rle(v)
unlist(sapply(seq_along(rv$lengths), function(i)
   c(rv$values[i], rep(NA, rv$lengths[i]-1))))
## [1]  1 NA  5 NA  2 NA  6 NA  1  2 NA NA NA NA NA  3

Explanation: rle returns a list consisting of 2 vectors, lenghts and values:

unclass(rv)
## $lengths
## [1] 2 2 2 2 1 6 1
## 
## $values
## [1] 1 5 2 6 1 2 3

from which we may create the result. The first value, 1, occurs 2 times in the input vector. So in the output we want 1 and 2-1 NAs to follow. Then 5 occurs 2 times, so we get 5, NA, and so on.

EDIT: However, this solution is quite slow (comparing the other listed):

set.seed(123L)
v <- sample(1:5, 10000, replace=TRUE)
library(microbenchmark)
microbenchmark(...)
## Unit: milliseconds
##                  min         lq     median         uq        max neval
## @Stephan    1.161341   1.193744   1.230734   1.248493   5.867357   100
## @beginneR   2.568235   2.618651   2.655130   3.034742   8.837571   100
## @gagolews 102.307481 111.128368 117.279179 121.308154 195.238260   100

EDIT2: As my really slow rle-based solution got accepted, here's an Rcpp-based solution for speed lovers:

library(Rcpp)
cppFunction("
   NumericVector duptrack(NumericVector v) {
      int n = v.size();
      NumericVector out(Rcpp::clone(v));
      for (int i=1; i<n; ++i)
         if (v[i] == v[i-1])
            out[i] = NA_REAL;
      return out;
   }
")

Benchmarks:

## Unit: milliseconds
##                              min       lq    median       uq     max  neval
## @gagolews-Rcpp          0.077296 0.080160 0.0832595 0.089952 2.31203    100
## @Stephan                1.161027 1.167035 1.1759645 1.223393 6.21994    100

EDIT3: As of all R code, we should also be interested in how the solutions deal with vectors with missing values.

For v <- c(1,1,NA,2,NA,2,2) we get:

  • 1 NA NA 2 NA 2 NA -- @gagolews
  • 1 NA NA 2 NA 2 NA -- @Stephan
  • NA NA NA NA NA NA NA -- @beginneR with stats::lag
  • 1 NA NA 2 NA 2 NA -- @beginneR with dplyr::lag
  • 1 NA NA 2 NA 2 NA -- @gagolews-Rcpp
gagolews
  • 12,836
  • 2
  • 50
  • 75
  • I need to wait 8 minutes to give your answer a check mark, but I tried it out and it works perfectly! Thanks! I don't understand why it works, but that's for me to do some homework on your answer. Thanks again! – Reuben Mathew May 23 '14 at 11:52
  • @ReubenMathew, the other answers perform faster, benchmarks will follow soon. – gagolews May 23 '14 at 11:55
  • Thank you kindly for the explanation. I appreciate your time and efforts. – Reuben Mathew May 23 '14 at 11:57
  • +1 for adding benchmarks! (Sorry, I can't bring myself to upvote the `rle`construct as such, although something like that is the first thing I tried, too... I find it too hard to parse to be useful, thinking of all the times I had to reactivate code after six months.) – Stephan Kolassa May 23 '14 at 11:59
  • @gagolews (-1) sorry, but your last edit is not correct about my solution. i just tested it. the result is exactly as the others – talat May 23 '14 at 12:53
  • @beginneR: Strange... I'm really getting NAs only (see: http://rexamine.com/manual_upload/so_23828369.png). BTW, no offence. – gagolews May 23 '14 at 12:56
  • @gagolews ok, I found out why. I have `dplyr` loaded in my library and it uses `dplyr::lag` for the operation. with base R `lag` it produces NA as you described (I'll remove the downvote, but perhaps you make a note of this in your comparison). ( i can only remove the downvote if you edit the answer again) – talat May 23 '14 at 12:59
0

You can go like this :

v <- c(1, 1, 5, 5, 2, 2, 6, 6, 1, 2, 2, 2, 2, 2, 2, 3)
x<-c(0,v[1:(length(v)-1)])
v[(v-x)==0]<-'NA'
w<-data.frame(v)
gagolews
  • 12,836
  • 2
  • 50
  • 75
  • 1
    Hmm... it returns a character vector. You probably meant `NA` instead of `"NA"`. BTW, basically this is the same as @StephanKolassa's solution.. – gagolews May 23 '14 at 13:10