How to drop factors that have fewer than n members

Question

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?

Data:

DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9, 
                id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))

Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.

 1: a 3 5  2
 2: b 6 6  2
 3: b 1 7  2
 4: b 3 8  2
 5: b 6 9  2
 6: b 1 1  3
 7: c 3 2  3
 8: c 6 3  3
 9: c 1 4  3
10: c 3 5  3
11: c 6 6  3

Here's an approach....

Get the length of the factors, and the factors to keep

nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5

Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?

idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]

The grouping variable is "id", and groups with fewer than 5 rows should be deleted. In the example, groups "1" and "4" are removed. — , Nov 20 '14 at 20:42

Rich Scriven · Accepted Answer · 2014-11-20T23:45:17.837

2

Since you begin with a data.table, this first part uses data.table syntax.

EDIT: Thanks to Arun (comment) for helping me improve this data table answer

DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
#     x y v id
#  1: a 3 5  2
#  2: a 6 6  2
#  3: b 1 7  2
#  4: b 3 8  2
#  5: b 6 9  2
#  6: b 1 1  3
#  7: b 3 2  3
#  8: b 6 3  3
#  9: c 1 4  3
# 10: c 3 5  3
# 11: c 6 6  3

In base R you could use

df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
#    x y v id
# 5  a 3 5  2
# 6  a 6 6  2
# 7  b 1 7  2
# 8  b 3 8  2
# 9  b 6 9  2
# 10 b 1 1  3
# 11 b 3 2  3
# 12 b 6 3  3
# 13 c 1 4  3
# 14 c 3 5  3
# 15 c 6 6  3

edited Nov 20 '14 at 23:45

answered Nov 20 '14 at 21:10

Rich Scriven

97,041
11
181
245

No keys necessary for aggregations. `DT[, .SD[.N >=5L], by=id]` would be the ideal way. But until it's optimised for speed, `DT[DT[, .(I=.I[.N>=5L]), by=id]$I]` – Arun Nov 20 '14 at 22:30
If I remember old discussions, `DT[,if(.N >= 5) .SD,by=id]` is also slightly quicker if the `.SD` syntax is used. – thelatemail Nov 21 '14 at 01:38

score 2 · Answer 2 · answered Nov 20 '14 at 21:40

2

If using a data.table is not necessary, you can use dplyr:

library(dplyr)

data.frame(DT) %>%
  group_by(id) %>%
  filter(n() >= 5)

answered Nov 20 '14 at 21:40

davechilders

8,693
2
18
18

How to drop factors that have fewer than n members

2 Answers2

Linked