2

Trying to create a function that will compute the average of some variable, whose name is provided in the function. For instance:

mean_of_var <- function(var){
  open_dataset('myfile') %>% summarise(meanB=mean(get(var)    ,na.rm = T),
                                 medianB=median(get(var),na.rm = T)) %>% collect %>% return
}
mean_of_var('myvar')

The main problem is that arrow:open_dataset does not support the get() function. So I get the error message:

Error: Error : Expression mean(get(var), na.rm = T) not supported in
   Arrow Call collect() first to pull data into R.

Is there a way to write a function like that, while keeping the use of the "open_dataset('myfile')" function.

LucasMation
  • 2,408
  • 2
  • 22
  • 45

1 Answers1

3

The dplyr verbs used in arrow rely on "tidy evaluation". You therefore need to "embrace" your variable names within your function:

library(arrow)
library(dplyr)

## create a parquet file to read with `open_dataset()`
pq_file <- tempfile(fileext = ".parquet")
dd <- tibble::tibble(
  col1 = rnorm(100),
  col2 = rnorm(100),
  col3 = rnorm(100)
)
write_parquet(dd, sink = pq_file)

mean_of_var <- function(var) {
  open_dataset(pq_file) %>%
    summarize(
      meanB = mean({{ var }}, na.rm = TRUE),
      medianB = median({{ var }}, na.rm = TRUE)
    ) %>%
    collect()      
}

To use the function:

mean_of_var(col2)
fmic_
  • 2,281
  • 16
  • 23