7

I have a data frame in R that is supposed to have duplicates. However, there are some duplicates that I would need to remove. In particular, I only want to remove row-adjacent duplicates, but keep the rest. For example, suppose I had the data frame:

df = data.frame(x = c("A", "B", "C", "A", "B", "C", "A", "B", "B", "C"), 
                y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))

This results in the following data frame

x   y
A   1
B   2
C   3
A   4
B   5
C   6
A   7
B   8
B   9
C   10

In this case, I expect there to be repeating "A, B, C, A, B, C, etc.". However, it is only a problem if I see adjacent row duplicates. In my example above, that would be rows 8 and 9 with the duplicate "B" being adjacent to each other.

In my data set, whenever this occurs, the first instance is always a user-error, and the second is always the correct version. In very rare cases, there might be an instance where the duplicates occur 3 (or more) times. However, in every case, I would always want to keep the last occurrence. Thus, following the example from above, I would like the final data set to look like

A   1
B   2
C   3
A   4
B   5
C   6
A   7
B   9
C   10

Is there an easy way to do this in R? Thank you in advance for your help!


Edit: 11/19/2014 12:14 PM EST There was a solution posted by user Akron (spelling?) that has since gotten deleted. I am now sure why because it seemed to work for me?

The solution was

df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]

It seems to work for me, why did it get deleted? For example, in cases with more than 2 consecutive duplicates:

df = data.frame(x = c("A", "B", "B", "B", "C", "C", "C", "A", "B", "C", "A", "B", "B", "C"), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   x  y
1  A  1
2  B  2
3  B  3
4  B  4
5  C  5
6  C  6
7  C  7
8  A  8
9  B  9
10 C 10
11 A 11
12 B 12
13 B 13
14 C 14

> df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
> df
   x  y
1  A  1
4  B  4
7  C  7
8  A  8
9  B  9
10 C 10
11 A 11
13 B 13
14 C 14

This seems to work?

Vincent
  • 7,808
  • 13
  • 49
  • 63

3 Answers3

7

Try

 df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
#   x  y
#1  A  1
#2  B  2
#3  C  3
#4  A  4
#5  B  5
#6  C  6
#7  A  7
#9  B  9
#10 C 10

Explanation

Here, we are comparing an element with the element preceding it. This can be done by removing the first element from the column and that column compared with the column from which last element is removed (so that the lengths become equal)

 df$x[-1] #first element removed
 #[1] B C A B C A B B C
 df$x[-nrow(df)]
  #[1] A B C A B C A B B #last element `C` removed

 df$x[-1]!=df$x[-nrow(df)]
 #[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

In the above, the length is 1 less than the nrow of df as we removed one element. Inorder to compensate that, we can concatenate a TRUE and then use this index for subsetting the dataset.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • This solution works, but I am confused as to why it works. I looked up the help page for with, but I am still unsure. I apologize in advance, I am a novice R user. – Vincent Nov 19 '14 at 19:18
  • If you wanted to apply this to x as a vector instead of a column in a df, you can use: `x[c(x[-1]!= x[-length(x)], TRUE)]` – Jeffrey Girard May 05 '22 at 15:27
5

Here's an rle solution:

df[cumsum(rle(as.character(df$x))$lengths), ]
#    x  y
# 1  A  1
# 2  B  2
# 3  C  3
# 4  A  4
# 5  B  5
# 6  C  6
# 7  A  7
# 9  B  9
# 10 C 10

Explanation:

RLE stands for Run Length Encoding. It produces a list of vectors. One being the runs, the values, and the other lengths being the number of consecutive repeats of each value. For example, x <- c(3, 2, 2, 3) has a runs vector of c(3, 2, 3) and lengths c(1, 2, 1). In this example, the cumulative sum of the lengths produces c(1, 3, 4). Subset x with this vector and you get c(3, 2, 3). Note that the second element of the lengths vector is the third element of the vector and the last occurrence of 2 in that particular 'run'.

blakeoft
  • 2,370
  • 1
  • 14
  • 15
2

You could also try

df[c(diff(as.numeric(df$x)), 1) != 0, ]

In case x is of character class (rather than factor), try

df[c(diff(as.numeric(factor(df$x))), 1) != 0, ]
#    x  y
# 1  A  1
# 2  B  2
# 3  C  3
# 4  A  4
# 5  B  5
# 6  C  6
# 7  A  7
# 9  B  9
# 10 C 10
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • I think you need to pad your diff:s, like `c(1, diff(as.integer(as.factor(df$x))))`. Check e.g. `x <- c("a", "a", "b")`; `x[diff(as.numeric(factor(x))) != 0]` – Henrik Nov 19 '14 at 17:27
  • @Henrik, your method won't reach desired result. Compare on the newly provided data set – David Arenburg Nov 19 '14 at 17:32
  • @David Arenburg Suppose if the data is (based on the edited dataset) `set.seed(14); df1 <- df[sample(1:nrow(df)),];row.names(df1) <- NULL; df1[diff(as.numeric(factor(df1$x))) != 0, ]` is not giving the last row. Perhaps, padding is needed. – akrun Nov 19 '14 at 17:46
  • 1
    @akrun, good catch. It seems that if the dupe is at the start, then padding is needed. Though the padding should be added to the end, rather to the beginning (like in Henrik example). Edited. – David Arenburg Nov 19 '14 at 17:52
  • That's why I asked you to check your code on `x <- c("a", "a", "b")` ;) – Henrik Nov 19 '14 at 20:37