2

I have opened a .parquet dataset through the open_dataset function of the arrow package. I want to use across to clean several numeric columns at a time. However, when I run this code:

start_numeric_cols = "sum"
sales <- sales %>% mutate(
  across(starts_with(start_numeric_cols) & (!where(is.numeric)), 
         \(col) {replace(col, col == "NULL", 0) %>% as.numeric()}),
  across(starts_with(start_numeric_cols) & (where(is.numeric)),
         \(col) {replace(col, is.na(col), 0)})
)
#> Error in `across_setup()`:
#> ! Anonymous functions are not yet supported in Arrow

The error message is pretty informative, but I am wondering whether there is any way to do the same only with dplyr verbs within across (or another workaround without having to type each column name).

1 Answers1

3

arrow has a growing set of functions that can be used without pulling the data into R (available here) but replace() is not yet supported. However, you can use ifelse()/if_else()/case_when(). Note also that purrr-style lambda functions are supported where regular anonymous functions are not.

I don't have your data so will use the iris dataset as an example to demonstrate that the query builds successfully, even if it doesn't make complete sense in the context of this data.

library(arrow)
library(dplyr)

start_numeric_cols <- "P"

iris %>%
  as_arrow_table() %>%
  mutate(
    across(
    starts_with(start_numeric_cols) & (!where(is.numeric)),
    ~ as.numeric(if_else(.x == "NULL", 0, .x))
  ),
  across(
    starts_with(start_numeric_cols) & (where(is.numeric)),
    ~ if_else(is.na(.x), 0, .x)
  )
)

Table (query)
Sepal.Length: double
Sepal.Width: double
Petal.Length: double (if_else(is_null(Petal.Length, {nan_is_null=true}), 0, Petal.Length))
Petal.Width: double (if_else(is_null(Petal.Width, {nan_is_null=true}), 0, Petal.Width))
Species: dictionary<values=string, indices=int8>

See $.data for the source Arrow object
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
  • 1
    That worked great! For other users, you can also use the pipe within the `across` `.fns` argument in `arrow`, which I first left as `~ifelse(.x == "NULL", 0, .x) %>% as.numeric())`. Although for this particular case I think the output is identical. Thanks also for the link to implemented functions in arrow, it is really useful. – Alberto Agudo Dominguez Apr 10 '23 at 11:31
  • 1
    Just a small comment, when doing what was suggested in the first edit (i.e.: `~ifelse(.x == "NULL", 0, as.numeric(.x))`) I did not get any errors until calling `compute`, which returned `Error in compute(): ! Invalid: Failed to parse string: 'NULL' as a scalar of type double`. This last alternative of wrapping the `if_else` statement with `as.numeric` does work. – Alberto Agudo Dominguez Apr 10 '23 at 11:41
  • 1
    @AlbertoAgudoDominguez - that's good to know. I was unable to test but realized that doing the coercion inside the ifelse statement could cause issues (as well as being bad practice). – Ritchie Sacramento Apr 10 '23 at 11:44