Say I want to summarise a column in an arrow
table prior to collecting (because the actual dataset is larger than memory). I could do something like this:
arrow_table(mtcars) %>%
summarise(mean(mpg)) %>%
collect()
# A tibble: 1 × 1
# `mean(mpg)`
# <dbl>
# 1 20.1
Now, say I want to do this programmatically and the column name is provided as a string. In regular (i.e., non-arrow
) dplyr
, I could use across
and all_of
like this:
foo_regular <- function(x){
mtcars %>%
summarise(across(all_of(x), mean)) %>%
collect()
}
foo_regular("mpg")
# mpg
# 1 20.09062
But how do I do this in arrow
?
foo_arrow <- function(x){
arrow_table(mtcars) %>%
summarise(across(all_of(x), mean)) %>%
collect()
}
foo_arrow("mpg")
# Warning: Error in summarize_eval(names(exprs)[i], exprs[[i]], ctx, length(.data$group_by_vars) > :
# Expression across(all_of(x), mean) is not an aggregate expression or is not supported in Arrow; pulling data into R
# Error:
# ! Problem while computing `..1 = across(all_of(x), mean)`.
# Caused by error in `across()`:
# ! Can't subset columns that don't exist.
# ✖ Column `mpg` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.
Clearly, performing the mean on that column is possible prior to collect in arrow
as my first code chunk does this, but how do I specify column names with strings? As I say, the actual dataset is massive so pulling the data into R first isn't an option.