Eliminate cases based on multiple rows values

Question

I have a base with the following information:

edit: *each row is an individual that lives in a house, multiple individuals with a unique P_ID and AGE can live in the same house with the same H_ID, I'm looking for all the houses with all the individuals based on the condition that there's at least one person over 60 in that house, I hope that explains it better *

show(base)

              H_ID           P_ID        AGE        CONACT
1      10010000001   1001000000102        35          33
2      10010000001   1001000000103        12          31
3      10010000001   1001000000104         5          NA
4      10010000001   1001000000101        37          10
5      10010000002   1001000000206         5          NA
6      10010000002   1001000000205        10          NA
7      10010000002   1001000000204        18          31
8      10010000002   1001000000207         3          NA
9      10010000002   1001000000203        24          35
10     10010000002   1001000000202        43          33
11     10010000002   1001000000201        47          10
12     10010000003   1001000000302        26          33
13     10010000003   1001000000301        29          10
14     10010000004   1001000000401        56          32
15     10010000004   1001000000403        22          31
16     10010000004   1001000000402        49          10
17     10010000005   1001000000503         1          NA
18     10010000005   1001000000501        24          10
19     10010000005   1001000000502        23          10
20     10010000006   1001000000601        44          10
21     10010000007   1001000000701        69          32

I want a list with all the houses and all the individuals living there based on the condition that there's at least one person 60+, here's a link for the data: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing

And here's how I made the base:

hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)

base<-data.frame(datos$ID_VIV, datos$ID_PERSONA, datos$EDAD, datos$CONACT)
base

Any help is much much appreciated, Thanks!

If you are keeping id basedon at least one element AGE > 60, but why there are no IDs with that group — akrun, Mar 12 '20 at 21:28
each row is an individual that lives in a house, multiple individuals with a unique P_ID and AGE can live in the same house with the same H_ID, I'm looking for all the houses with all the individuals based on the condition that there's at least one person over 60 in that house, I hope that explains it better — Raúl Valentín Fonseca, Mar 12 '20 at 21:32
there is only row that shows AGE > 60, in the expected output, none of H_ID groups have AGE > 60 — akrun, Mar 12 '20 at 21:33
It is part of a larger data set, on those first 21 cases there is only one individual over 60, but there are a lot more — Raúl Valentín Fonseca, Mar 12 '20 at 21:35
I want the houses list that include a person over 60 with all the persons that live there, including the 60+, sorry if I'm not being clear enough, I'm gonna try to put the data now on my question — Raúl Valentín Fonseca, Mar 12 '20 at 21:42
sorry for the time, the data is now up and I added how i obtain those variables — Raúl Valentín Fonseca, Mar 12 '20 at 21:58

George Savva · Accepted Answer · 2020-03-13T12:24:21.400

This can be done by:

Adding a variable with the maximum age per household

base$maxage <- ave(base$AGE, base$H_ID, FUN=max)

Then only keeping households with a maximum age above 60.

base <- subset(base, maxage >= 60)

Or you could combine the two lines into one. With the column names in your linked data:

> base <- subset(base, ave(base$datos.EDAD, base$datos.ID_VIV, FUN=max) >= 60)
> head(base)
   datos.ID_VIV datos.ID_PERSONA datos.EDAD datos.CONACT
21  10010000007    1001000000701         69           32
22  10010000008    1001000000803         83           33
23  10010000008    1001000000802         47           33
24  10010000008    1001000000801         47           10
36  10010000012    1001000001204          4           NA
37  10010000012    1001000001203          2           NA

This was perfect! Thank you :) – Raúl Valentín Fonseca Mar 13 '20 at 01:35 — Raúl Valentín Fonseca, Mar 13 '20 at 01:35

score 1 · Answer 2 · answered Mar 13 '20 at 01:46

1

Using dplyr, we can group_by H_ID and select houses where any AGE is greater than 60.

library(dplyr)
df %>% group_by(H_ID) %>% filter(any(AGE > 60))

Similarly with data.table

library(data.table)
setDT(df)[, .SD[any(AGE > 60)], H_ID]

answered Mar 13 '20 at 01:46

Ronak Shah

377,200
20
156
213

TTS · Answer 3 · 2020-03-12T22:22:59.780

To get a list of the houses with a tenant Age > 60 we can filter and create a list of distinct H_IDs

house_list <- base %>%
  filter(AGE > 60) %>% 
  distinct(H_ID) %>%
  pull(H_ID)

Then we can filter the original dataframe based on that house_list to remove any households that do not have someone over the age of 60.

house_df <- base %>%
  filter(H_ID %in% house_list)

To then calculate the CON values we can filter out NA values in CONACT, group_by(H_ID) and summarize to find the number of individuals within each house that have a non-NA CONACT value.

CON_calcs <- house_df %>%
  filter(!is.na(CONACT)) %>% 
  group_by(H_ID) %>%
  summarize(Count = n())

And join that back into the house_df based on H_ID to include the newly calculated CON values, and I believe that should end with your desired result.

final_df <- left_join(house_df, CON_calcs, by = 'H_ID')

I added the data for testing – Raúl Valentín Fonseca Mar 12 '20 at 22:03 — Raúl Valentín Fonseca, Mar 12 '20 at 22:03

Eliminate cases based on multiple rows values

3 Answers3