2

Consider the following dataset:

df_test_1 <- 
    data.frame(time = c(seq(20, 40, by = 5), NA))

Find all rows where time is greater than zero:

log_vec <- df_test_1$time > 0

such that:

> log_vec
[1] TRUE TRUE TRUE TRUE TRUE   NA

Consider filtering the original dataset on this condition in base R:

> df_test_1[log_vec, ,drop = FALSE]
   time
1    20
2    25
3    30
4    35
5    40
NA   NA

and the dplyr version:

> df_test_1 %>% filter(log_vec)
  time
1   20
2   25
3   30
4   35
5   40

Notice how the row with NA is returned in base R but not dplyr. Why is this happening, and is this behaviour always expected? I cannot find the documentation for this in the helpfile ?filter. (Note, this has previously been observed in this question Use group_by to filter specific cases while keeping NAs)

Community
  • 1
  • 1
Alex
  • 15,186
  • 15
  • 73
  • 127
  • A similar question regarding `filter` was asked sometime back. If i remember, the `filter` is designed to remove the NAs. If you want the NA, you can use `|`, i.e. `df_test_1 %>% filter(log_vec | is.na(log_vec))` – akrun Jun 30 '16 at 03:20
  • 1
    right, thanks. I am just wondering why it is not documented anywhere. – Alex Jun 30 '16 at 03:28
  • For some functions (from the dplyr/tidyr), I had seen the function properties change with each new release i.e. that works before may not work now and viceversa. – akrun Jun 30 '16 at 03:30
  • that is most unfortunate :( – Alex Jun 30 '16 at 03:33

0 Answers0