Summarise before collecting in arrow using strings for column names

Question

Say I want to summarise a column in an arrow table prior to collecting (because the actual dataset is larger than memory). I could do something like this:

arrow_table(mtcars) %>% 
  summarise(mean(mpg)) %>% 
  collect()

# A tibble: 1 × 1
#     `mean(mpg)`
#           <dbl>
#   1        20.1

Now, say I want to do this programmatically and the column name is provided as a string. In regular (i.e., non-arrow) dplyr, I could use across and all_of like this:

foo_regular <- function(x){
  mtcars %>% 
    summarise(across(all_of(x), mean)) %>% 
    collect()
}

foo_regular("mpg")

#        mpg
# 1 20.09062

But how do I do this in arrow?

foo_arrow <- function(x){
  arrow_table(mtcars) %>%
    summarise(across(all_of(x), mean)) %>%
    collect()
}

foo_arrow("mpg")

# Warning: Error in summarize_eval(names(exprs)[i], exprs[[i]], ctx, length(.data$group_by_vars) >  : 
# Expression across(all_of(x), mean) is not an aggregate expression or is not supported in Arrow; pulling data into R
# Error:
#   ! Problem while computing `..1 = across(all_of(x), mean)`.
# Caused by error in `across()`:
#   ! Can't subset columns that don't exist.
# ✖ Column `mpg` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.

Clearly, performing the mean on that column is possible prior to collect in arrow as my first code chunk does this, but how do I specify column names with strings? As I say, the actual dataset is massive so pulling the data into R first isn't an option.

thisisnic · Accepted Answer · 2022-11-07T12:19:51.530

3

[Edited to add: the advice below is no longer necessary; version 10.0.0 of arrow, which supports across() has now been released]

In the most recent released version of Arrow (9.0.0.1), across() is not yet implemented, but it has been implemented in the most recent development version, and so should be in the upcoming release (10.0.0).

For the moment, you can either install a nightly version of arrow via arrow::install_arrow(nightly = TRUE), which will successfully run your code example, or manually specify the columns/functions to summarise() without using across().

edited Nov 07 '22 at 12:19

answered Sep 29 '22 at 14:36

thisisnic

820
5
10

1

I see you contribute to arrow/R, thanks! Do you know of an expected timeline for release of 10? I know it depends on many things out of your control, but are you expected days, weeks, or months? – r2evans Sep 29 '22 at 16:43
1

We do quarterly releases. The next main project major release is scheduled for roughly 1 month from now, and then we send the R package to CRAN. – thisisnic Sep 29 '22 at 21:56

Summarise before collecting in arrow using strings for column names

1 Answers1