2

Say I have a function called boop. It has different behaviour depending on the class of its argument, so I use generics, like so:

library(dplyr)

df <- data.frame(a = c("these", "are", "some", "strings"),
                 b = 1:4)

boop <- function(x, ...) UseMethod("boop", x)

boop.numeric <- function(x) mean(x, na.rm = TRUE)

boop.character <- function(x) mean(nchar(x), na.rm =TRUE)

df %>% summarise(across(everything(), boop))
                                
#      a   b
# 1 4.75 2.5

Perfect! Now, say I want to use boop with a parquet file before collecting the data. I can write similar dplyr code to above for the summarise, but first I need to register my functions. For example,

library(arrow)

register_scalar_function(
  "boop.numeric",
  function(context, x) {
    mean(x, na.rm = TRUE)
  },
  in_type = schema(x = float64()),
  out_type = float64(),
  auto_convert = TRUE
)

But how do I first of all define boop as a generic? If I translate my original boop directly into an arrow format I need to define the input schema. Nevertheless, unlike boop.numeric or boop.character, its generic so x doesn't have a class.


Question: How do I use generics, such as shown above, with Apache Arrow prior to collecting data?

Dan
  • 11,370
  • 4
  • 43
  • 68

1 Answers1

3

I don't believe that this is currently possible.

I'm currently working on a PR which enables the use of where() which I hope to have done for the upcoming release (end of the month). Using this in conjunction with across(), the PR for which is now merged, would allow you to manually specify df %>% summarise(across(where(is.numeric), boop.numeric)) etc, as a workaround.

I've also opened a ticket on the project JIRA requesting implementation of the ability to call user-defined generic functions.

thisisnic
  • 820
  • 5
  • 10