Missing data when using reverse function for filtering based on number of NA

Question

I am looking for assistance to retain incomplete data when filtering based on the number of NA. I am conducting a 25-week quasi-experiment where an intervention occurs at Week 13. For my primary analysis I am only including participants with at least 3 weeks of measurements in the pre- and post-intervention period. I was able to retain my analytic sample using code from this link: Filter based on NA in dplyr

However, I can't obtain the correct number of participants with incomplete data to retain the sample size in the original dataset when combined with those with complete data. As an example, when I apply the filter I obtain 2/3 of participants but when I use the reverse function (i.e., removing ! from is.na) I do not get the other 1/3. Here is the code I used to obtain my analytic sample, followed by the code I am trying to use to obtain participants with incomplete data:

BCData6 <- BCData5 %>%
  group_by(user_id)%>%
  filter(sum(!is.na(Average.Steps)[Intervention==0])>=3)%>%
  filter(sum(!is.na(Average.Steps)[Intervention==1])>=3)

NLData7 <- NLData5 %>%
  group_by(user_id)%>%
  filter(sum(is.na(Average.Steps)[Intervention==0])>=3)%>%
  filter(sum(is.na(Average.Steps)[Intervention==1])>=3)

When applying this code, it results in 348,075 observations from the original sample size of 548,200. However, when removing ! it yields a dataset with 182,450 observations which sums to 530,525: 17,675 short of the original sample size.

Any guidance would be greatly appreciated!

EDIT

    > dput(NLData6[1:25,c(9,10)])
structure(list(Week = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25), Average.Steps = c(2124, 
3115, 2325, 2586, 4273, 3981, 5716, 4724, 3948, 1531, 1539, 4166, 
2016, 2453, 1700, 1903, 1546, 2139, 1765, 1608, 2416, 2254, 2136, 
1827, 1906)), row.names = c(NA, -25L), class = c("tbl_df", "tbl", 
"data.frame"))

Please forgive my naivety; I'm still figuring out R Studio itself along with the customs of Stack Overflow and Cross Validated.

Data output of NLData6

@akrun is my edit what you were referring to? I appreciate any help to accelerate the learning curve — Sean Spilsbury, Aug 07 '21 at 21:37
yes, that is what I meant. thanks. Also, if you can please show the expected for that data you showed as input. — akrun, Aug 07 '21 at 21:40
Sorry, I couldn't figure out a way to use a tibble that was legible so I resorted to a picture instead — Sean Spilsbury, Aug 07 '21 at 21:56

score 0 · Answer 1 · answered Aug 08 '21 at 00:04

Referring to your question you wrote "As an example, when I apply the filter I obtain 2/3 of participants but when I use the reverse function (i.e., removing ! from is.na) I do not get the other 1/3.", so you changed

BCData5 %>%
  group_by(user_id) %>%
  filter(sum(!is.na(Average.Steps)[Intervention==0])>=3)

into

BCData5 %>%
  group_by(user_id) %>%
  filter(sum(is.na(Average.Steps)[Intervention==0])>=3)

(for simplication I only use one filter-function). Removing ! doesn't give you the remaining 1/3 of your participants but the participants with sum(is.na(...)) >= 3. So you are still missing the ones with sum(!is.na(...)) < 3 and sum(is.na(...)) < 3.

Missing data when using reverse function for filtering based on number of NA

1 Answers1