6

I have a data where the first column is a bunch of ID numbers (some repeat), and the second column is just a bunch of numbers. I need a way to keep each ID number only once based on the smallest number in the second column.

Row#   ID   Number
1      10     180
2      12     167
3      12     182
4      12     135
5      15     152
6      15     133

Ex: I only want to keep Row# 1, 4, and 6 here and delete the rest

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
Swanny
  • 189
  • 1
  • 12

2 Answers2

14

For selecting the row that has the minimum 'Number' for each 'ID' group, we can use one of the aggregating by group function. A base R option is aggregate. With aggregate, we can either use the 'formula' method or specify a list of grouping elements/variables with the by argument. Using the formula method, we get the min value of 'Number' for each 'ID'.

aggregate(Number~ID, df1, FUN=min)

Or we can use a faster option with data.table. Here, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', we get the min value of "Number".

library(data.table)
setDT(df1)[, list(Number=min(Number)), by = ID] 

Or this can be also done with setorder to order the 'Number' column and use unique with by option to select the first non-duplicated 'ID' row. (from @David Arenburgs' comments)

 unique(setorder(setDT(df1), Number), by = "ID")

Or using dplyr, we group by 'ID' and get the subset rows with summarise.

library(dplyr)
df1 %>%
   group_by(ID) %>%
   summarise(Number= min(Number))

Or we can use sqldf syntax to get the subset of data.

library(sqldf)
sqldf('select ID,
        min(Number) as Number
        from df1 
        group by ID')

Update

If there are multiple columns and you want to get the row based on the minimum value of 'Number' for each 'ID', you can use which.min. Using .I will get the row index and that can be used for subsetting the rows.

setDT(df1)[df1[,  .I[which.min(Number)], by = ID]$V1]

Or with dplyr we use slice to filter out the rows that have the min value of 'Number' for each 'ID'

df1 %>% 
    group_by(ID) %>%
    slice(which.min(Number))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Or `unique(setorder(setDT(df1), Number), by = "ID")`. I also think you need `which.min` instead of `min` because it seems that OP *always* want to keep one row per `ID`. Btw, nice `sqldf` solution there :) – David Arenburg Jun 15 '15 at 18:45
  • @DavidArenburg Thanks, `min` keeps only the single value, but `which.min` will be more general when there are more columns. – akrun Jun 16 '15 at 04:14
2

Or

do.call(rbind, 
lapply(split(df1, df1$ID), function(x) subset(x, Number == min(Number))))

#ID Number
#10 10    180
#12 12    135
#15 15    133
Veerendra Gadekar
  • 4,452
  • 19
  • 24