dplyr: How to filter groups by subgroup criteria

Question

My question is similar to this one, but the filter criteria is different.

> demo(dadmom,package="tidyr")

> library(tidyr)
> library(dplyr)

> dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")

> dadmom %>%
+   gather(key, value, named:incm) %>%
+   separate(key, c("variable", "type"), -2) %>%
+   spread(variable, value, convert = TRUE)
  famid type   inc name
1     1    d 30000 Bill
2     1    m 15000 Bess
3     2    d 22000  Art
4     2    m 18000  Amy
5     3    d 25000 Paul
6     3    m 50000  Pat

It is easy to pick out the family with mom's income >20000 using "incm" from the original table:

> dadmom
  famid named  incd namem  incm
1     1  Bill 30000  Bess 15000
2     2   Art 22000   Amy 18000
3     3  Paul 25000   Pat 50000

The question is: how do you do it from the "tidied" data?

`dplyr::filter(data, type == "m" & inc > 20000)`? – Rich Scriven Mar 24 '15 at 00:52 — Rich Scriven, Mar 24 '15 at 00:52
That returns only one row. I want the whole group (m & d). – Dong Mar 24 '15 at 02:08 — Dong, Mar 24 '15 at 02:08

akrun · Accepted Answer · 2015-03-24T04:58:54.313

You could add group_by and filter to the codes

#OP's code
d1 <- dadmom %>%
           gather(key, value, named:incm) %>%
           separate(key, c("variable", "type"), -2) %>%
           spread(variable, value, convert = TRUE)

 d1 %>% 
    group_by(famid) %>%
    filter(all(sum(type=='m' & inc > 15000)==sum(type=='m')))

#    famid type   inc name
# 1     2    d 22000  Art
# 2     2    m 18000  Amy
# 3     3    d 25000 Paul
# 4     3    m 50000  Pat

NOTE: The above will also work when there are multiple 'm's per famid (a bit more general)

For normal cases of single 'm/f' pair per famid

 d1 %>%
     group_by(famid) %>% 
     filter(any(inc >15000 & type=='m'))
 #   famid type   inc name
 #1     2    d 22000  Art
 #2     2    m 18000  Amy
 #3     3    d 25000 Paul
 #4     3    m 50000  Pat

Also, if you wish to use data.table, melt from the devel version i.e. v1.9.5 can take multiple value columns. It can be installed from here

 library(data.table)
 melt(setDT(dadmom), measure.vars=list(c(2,4), c(3,5)), 
    variable.name='type', value.name=c('name', 'inc'))[,
    type:=c('d', 'm')[type]][, .SD[any(type=='m' & inc >15000)] ,famid]
 #    famid type name   inc
 #1:     2    d  Art 22000
 #2:     2    m  Amy 18000
 #3:     3    d Paul 25000
 #4:     3    m  Pat 50000

Thanks. I was not aware of "any" and "all". Are they efficient functions? I am asking because my actual data would have many groups, and many rows within the groups(though very regular, fixed number and sequence). — Dong, Mar 24 '15 at 15:05
It is not big now. ~(7x200k) --> (7x400k) after tidying. For now it is ~100k groups -- like the ones from `group_by(famid)`, and 4 rows per group. But both the width and length of the data will grow as I gather more experimental data. — Dong, Mar 24 '15 at 18:32
@Dong Just a thought, if it gets big, wouldn't it be better to subset before `gather` ( haven't benchmarked though) — akrun, Mar 25 '15 at 02:45

dplyr: How to filter groups by subgroup criteria

1 Answers1