6

I have a problem understanding how dplyr::case_when works. Here with this pretty simple line :

library(tidyverse)
case_when(TRUE ~ 50,
          FALSE ~ numeric(0))

I get numeric(0) while obviously, TRUE is TRUE and so it should send back 50. Besides, FALSE is FALSE so it should never send back numeric(0). I have not the problem if I write :

case_when(TRUE ~ 50,
      FALSE ~ NaN)

Where I get 50, which is right. What do I miss ?

Cettt
  • 11,460
  • 7
  • 35
  • 58
Malta
  • 1,883
  • 3
  • 17
  • 30
  • 3
    I think the problem is that numeric(0) returns a vector of length 0. If you try numeric(1) (which is a vector of length 1 with a value of 0) then it works. case_when should be reporting an error I would say, but it's not. – bischrob Jan 28 '21 at 17:35
  • For me this is unwanted behavour and I wasn't aware of it. Maybe you can notice the `dplyr` team on github. Generally, every outcome of `case_when` should have the same type and the same length. For example, `case_when(TRUE ~ 1:3, FALSE ~ 1:2)` throws an error. – Cettt Jan 28 '21 at 18:02
  • 1
    Huh, on rereading the question, I was assuming (and mis-reading) that the first code block failed. It should, in my mind. I'm with @Cettt, this is unwanted behavior. – r2evans Jan 28 '21 at 18:08
  • 3
    Apparently the [dplyr team sees this as a feature?](https://github.com/tidyverse/dplyr/issues/4852) – Gregor Thomas Jan 28 '21 at 18:10
  • It is complicated though. My immediate reaction is that I don't want `case_when` evaluating things it doesn't need to. I'd forego length checking for efficiency. `case_when(TRUE ~ 1, FALSE ~ {Sys.sleep(10); 0})` takes 10 seconds to return, but it could be instant. – Gregor Thomas Jan 28 '21 at 18:11
  • `if_else` and `case_when` are not short-circuited, @GregorThomas; while I agree that it would be a great thing, I don't think it's in the cards to make it so. :-( – r2evans Jan 28 '21 at 18:13
  • 1
    Apparently not. I had assumed that was one of the things `if_else` did to improve performance over `ifelse`, but `base::ifelse(TRUE, 1, {Sys.sleep(10); 0})` actually is short-circuited! – Gregor Thomas Jan 28 '21 at 18:15
  • 1
    I am opening a new issue, because the documentation seems murky at the very least. – Gregor Thomas Jan 28 '21 at 18:16
  • 1
    @GregorThomas, I disagree about optimizing out length-checking: R recycling, as long as its been around, has led to so many bugs when not recognized. When recycling is not desired but it *just happens to be* that the one vector length is a multiple of the other, recycling happens and likely corrupts the data. In my head, recycling should be length-same or length-1, nothing else unless explicitly allowed ``. Unlikely to change in base R, unfortunately. But `dplyr` makes intentional effort on things similar to this (enforcing `class`, e.g., when `ifelse` does not), surprised about this. – r2evans Jan 28 '21 at 18:17
  • I agree with you 100% on recycling - I love data.table's approach there as well. But this seems more restrictive. Why does this throw warnings? `x <- 1:-1; case_when(x > 0 ~ log(x), TRUE ~ as.numeric(x))`. – Gregor Thomas Jan 28 '21 at 18:34
  • `fcase` warns, too ... and it does no recycling (a problem in my book), so `TRUE` would need to be `rep(TRUE,3)` here (c.f., https://github.com/Rdatatable/data.table/issues/4258, still open). – r2evans Jan 28 '21 at 19:05

0 Answers0