7

After searching for a while, I know that this question has not been answered yet. Assume that I have the following vector

v <- c("a", "b", "b", "c","c","c", "d", "d", "d", "d")

How do I find those values having more than 1 duplicates

(should be "c","c","c", "d", "d", "d", "d")

and more than 2 duplicates

(should be "d", "d", "d", "d")

Function duplicated(v) only returns values having duplicates.

Duy Bui
  • 1,348
  • 6
  • 17
  • 38

2 Answers2

7

You can generate a table() and then check which elements of v are part of the relevant subset of the table, e.g.

R> v <- c("a", "b", "b", "c","c","c", "d", "d", "d", "d")
R> tab <- table(v)
R> tab
v
a b c d 
1 2 3 4 
R> v[v %in% names(tab[tab > 2])]
[1] "c" "c" "c" "d" "d" "d" "d"
R> v[v %in% names(tab[tab > 3])]
[1] "d" "d" "d" "d"
Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • Because this question came up: In case `v` is numeric you may want an additional `as.numeric(names(...))` to transform the table names back to numeric. But as pointed out in the comment below...it appears to work even without :-) – Achim Zeileis Apr 30 '15 at 16:44
  • Actually just realized that what I said was false. Surprisingly `1 %in% c("1", "2")` returns `TRUE`, so `as.numeric()` isn't needed even if `v` is numeric. – Alex A. Apr 30 '15 at 16:46
  • Note that if the OP simply wants a list of duplicated elements rather than a repeated list of duplicated elements, you can simply use `names(tab[tab > 1])` (likewise for 2, 3, ...) Or you can just wrap `unique()` around what you already have. (Waiting for the OP's confirmation on the desired form of the output.) – Alex A. Apr 30 '15 at 16:51
  • That's not the way I read the question...but as you say: it's easy to tweak the example if necessary. – Achim Zeileis Apr 30 '15 at 16:56
5

I would use ave to write a simple function like this:

myFun <- function(vector, thresh) {
  ind <- ave(rep(1, length(vector)), vector, FUN = length)
  vector[ind > thresh + 1] ## added "+1" to match your terminology
}

Here it is applied to "v":

myFun(v, 1)
# [1] "c" "c" "c" "d" "d" "d" "d"
myFun(v, 2)
# [1] "d" "d" "d" "d"

Of course, there is always "data.table":

as.data.table(v)[, N := .N, by = v][N > 1 + 1]$v
# [1] "c" "c" "c" "d" "d" "d" "d"
as.data.table(v)[, N := .N, by = v][N > 2 + 1]$v
# [1] "d" "d" "d" "d"
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485