Remove consecutive duplicates from dataframe

Question

I have a data frame that I want to remove duplicates that are consecutive (in base). I know rle may be helpful here but can't think of how to use it. The example output will help to illuminate what I'm asking for.

Generate sample data:

set.seed(12)
samps <- sample(1:5, 20, T)
dat <- data.frame(v1=LETTERS[samps], v2=month.abb[samps])
dat[10, 2] <- "Mar"

Sample data:

   v1  v2
1   A Jan
2   E May
3   E May
4   B Feb
5   A Jan
6   A Jan
7   A Jan
8   D Apr
9   A Jan
10  A Mar
11  B Feb
12  E May
13  B Feb
14  B Feb
15  B Feb
16  C Mar
17  C Mar
18  C Mar
19  D Apr
20  A Jan

Desired outcome:

   v1  v2
1   A Jan
3   E May
4   B Feb
7   A Jan
8   D Apr
10  A Mar
11  B Feb
12  E May
15  B Feb
18  C Mar
19  D Apr
20  A Jan

It appears that one could filter on a single column for your example, but is that the intent? — Matthew Lundberg, Dec 27 '12 at 14:45
no this is not the intent, just for convenience I made the two columns identical. I edited to reflect this. — Tyler Rinker, Dec 27 '12 at 15:10

Matthew Plourde · Accepted Answer · 2012-12-27T14:41:05.610

9

Here's a way, not with rle, but a way none-the-less:

dat[with(dat, c(TRUE, diff(as.numeric(interaction(v1, v2))) != 0)), ]

This assumes you're using factor columns, as your sample data implies.

edited Dec 27 '12 at 14:41

answered Dec 27 '12 at 14:34

Matthew Plourde

43,932
7
96
113

score 4 · Answer 2 · answered Dec 27 '12 at 15:33

4

Here a fast solution using filter

dat[(filter(dat,c(-1,1))!= 0)[,1],]
     v1   v2
1     A  Jan
3     E  May
4     B  Feb
7     A  Jan
8     D  Apr
10    A  Mar
11    B  Feb
12    E  May
15    B  Feb
18    C  Mar
19    D  Apr
NA <NA> <NA>

You need to add the last value of the original data to the result.

answered Dec 27 '12 at 15:33

agstudy

119,832
17
199
261

Thoughtful response thank you for your work on the problem. I rarely think about filter so seeing a usage was informative. – Tyler Rinker Dec 28 '12 at 01:31

adibender · Answer 3 · 2012-12-27T14:52:56.497

3

Using rle I came up with this

ind <- cumsum(rle(as.character(dat$v1))$length)
dat[ind, ]

ind indicates either the first or the last of consecutive entries.

EDIT:

A simple solution to Matthews comment would be

dat[15, 2] <- "May"
dat[cumsum(rle(paste0(dat$v1, dat$v2))$length), ]

edited Dec 27 '12 at 14:52

answered Dec 27 '12 at 14:40

adibender

7,288
3
37
41

1

This works on OP's sample data, but would fail if there were, say, two consecutive rows, 'E Feb' and 'E May'. – Matthew Plourde Dec 27 '12 at 14:45
you're right! I assumed the two columns would always have same values. – adibender Dec 27 '12 at 14:47
Thank you for your work. @Matthew's is a little more generalizable. – Tyler Rinker Dec 27 '12 at 15:13

Remove consecutive duplicates from dataframe

3 Answers3

Linked