remove duplicate id based on certain criteria

Question

id <-  c(1,1,2,3,4,4,5,6,7,7,7,8,9)
age <- c(10,10.6,11,11.3,10.9,11.4,10.7,11,10.5,11.1,12.3,10.3,10.7)
ageto11 <- abs(age-11)

df <- as.data.frame(cbind(id,age,ageto11))
df
   id  age ageto11
1   1 10.0     1.0
2   1 10.6     0.4
3   2 11.0     0.0
4   3 11.3     0.3
5   4 10.9     0.1
6   4 11.4     0.4
7   5 10.7     0.3
8   6 11.0     0.0
9   7 10.5     0.5
10  7 11.1     0.1
11  7 12.3     1.3
12  8 10.3     0.7
13  9 10.7     0.3

I am trying to remove the duplicated id in the above data frame, based on the criteria of selecting the smallest distance to age 11 (i.e. the smallest value of ageto11)

For example, when id=1, I would like to remove the first row, in which ageto11 is larger. When id=7, I would like to keep the 10th row, in which ageto11 is the smallest.

The desired result should be like

   id  age ageto11
2   1 10.6     0.4
3   2 11.0     0.0
4   3 11.3     0.3
5   4 10.9     0.1
7   5 10.7     0.3
8   6 11.0     0.0
10  7 11.1     0.1
12  8 10.3     0.7
13  9 10.7     0.3

akrun · Accepted Answer · 2015-09-01T03:54:33.877

We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by the 'id', get the difference of 'age' with 11, find the index of the minimum absolute value (which.min(abs..) and subset the dataset (.SD).

library(data.table)
setDT(df)[,.SD[which.min(abs(age-11))] , id]
#    id  age ageto11
#1:  1 10.6     0.4
#2:  2 11.0     0.0
#3:  3 11.3     0.3
#4:  4 10.9     0.1
#5:  5 10.7     0.3
#6:  6 11.0     0.0
#7:  7 11.1     0.1
#8:  8 10.3     0.7
#9:  9 10.7     0.3

EDIT: Just notified by @Pascal that the distance is already calculated in 'ageto11'. In that case

setDT(df)[, .SD[which.min(ageto11)], id]

remove duplicate id based on certain criteria

1 Answers1