6

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive. For example, for the following data frame:

df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)

x  y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9

I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:

x  y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9

Following directions from help and some other posts, I have tried using the duplicated function:

df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
      x  y  z
1     1 10  1
6     4 13  6
7     2 12  7
9     1 30  9
NA   NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA

Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.

I have also tried using the data.table package as follows:

library(data.table)
dt <- as.data.table(df)           
setkey(dt, x)                    
dt[J(unique(x)), mult ='last'] 

Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:

x  y z
1 30 9
2 49 8
4 13 6

Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive. I would appreciate any help.

Thanks

ebb
  • 274
  • 2
  • 6
  • 16
  • 2
    Maybe `inx <- with(df, c(FALSE, diff(x) != 0)); df[c(diff(cumsum(inx)) != 0, TRUE), ]`. (I am not sure that I like it, one too many `diff`.) – Rui Barradas Mar 15 '18 at 18:44
  • @Rui, maybe not elegant but does the job, thanks! what does the `c(FALSE, diff(x) !=0)` do? Sorry, I'm a newbie and trying to learn. I understand if you don't have the time to elaborate. Thanks – ebb Mar 15 '18 at 19:12

4 Answers4

7

How about:

df[cumsum(rle(df$x)$lengths),]

Explanation:

rle(df$x)

gives you the run lengths and values of consecutive duplicates in the x variable. Then:

rle(df$x)$lengths

extracts the lengths. Finally:

cumsum(rle(df$x)$lengths)

gives the row indices which you can select using [.

EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by @James, and would be the answer I would "accept", and dp being the dplyr answer given by @Nik.

#> Unit: microseconds
#>    expr       min         lq       mean     median         uq        max
#>     rle   134.389   145.4220   162.6967   154.4180   172.8370    375.109
#>  consec   111.411   118.9235   136.1893   123.6285   145.5765    314.249
#>      dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213

rle performs better than I thought it would.

ngm
  • 2,539
  • 8
  • 18
  • thanks! I've seen `rle` being used before for other purposes. Some people says it may fail with large data sets, but it works here! – ebb Mar 15 '18 at 19:14
  • 1
    My computer was perfectly happy to run `rle(sample(1:10000, 1e8, repl=TRUE))` in about 10 seconds. So I don't think `rle` is going to be the issue. – ngm Mar 15 '18 at 19:29
  • thanks for the info on performance, and for the explanations, I learned a lot. – ebb Mar 15 '18 at 20:54
  • Thank you for the `microbenchmark`. An eye opener for me – Nik Muhammad Naim Mar 16 '18 at 02:32
6

You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.

df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
  x  y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
James
  • 65,548
  • 14
  • 155
  • 193
  • I tried doing something similar `df[!(x[i] == x[i+1])` but I got lost in the syntax. Neat and simple, thanks – ebb Mar 15 '18 at 20:15
2

A cheap solution with dplyr that I could think of:

Method:

library(dplyr)
df %>% 
  mutate(id = lag(x, 1), 
         decision = if_else(x != id, 1, 0), 
         final = lead(decision, 1, default = 1)) %>% 
  filter(final == 1) %>% 
  select(-id, -decision, -final)

Output:

  x  y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 30 9

This will even work if your data has the same x value at the bottom

New Input:

df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2

   x  y  z
1  1 10  1
2  1 11  2
3  1 30  3
4  2 12  4
5  2 49  5
6  4 13  6
7  2 12  7
8  2 49  8
9  1 30  9
10 1 10 12

Use same method:

df2 %>% 
  mutate(id = lag(x, 1), 
         decision = if_else(x != id, 1, 0), 
         final = lead(decision, 1, default = 1)) %>% 
  filter(final == 1) %>% 
  select(-id, -decision, -final)

New Output:

  x  y  z
1 1 30  3
2 2 49  5
3 4 13  6
4 2 49  8
5 1 10 12
  • I didn't think on using `dplyr::` like this. It'll be very helpful, thanks. Some unrelated question, and sorry in advance for the simplicity: what does `%>%` do? I've seen it quite often and I don't seem to be able to find an answer for this. – ebb Mar 15 '18 at 20:23
1

Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x

library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )

This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.

ngm
  • 2,539
  • 8
  • 18
Calbers
  • 399
  • 3
  • 4
  • I was curious on how to achieve this with `data.table::`, thanks. This seems like the `data.table::` vs of the `dplyr::` response by @ Nik. Never thought of shifting/lagging the column. Is a nice roundabout to keep in mind. – ebb Mar 15 '18 at 20:44
  • 1
    I'd say this is the same logic as @James. Another `data.table` solution using run length encoding (i.e. `rle`) would be `dattab[, grp:=rleid(x)][,.SD[.N], by=grp][, -"grp"]`. What this does is: add a variable called `grp` based on the run length group. Then pick the last line within each group. Then delete the `grp` variable. – ngm Mar 16 '18 at 13:30
  • Same logic, yes. I have to admit I did not read his code carefully enough before I posted. I like your solution, though. Did not know `rleid`. – Calbers Mar 16 '18 at 18:35