Preserve the structure of R data frame upon modification (in tidy pipelines)

Question

I am working with data frames of a certain structure. For the sake of example, they are all expected to have an identical set of columns and unique values of the "id" column. There is also a set of S3 generics and methods for this class of data frames, like "plot", "write" and so on.

However, if the user modifies the structure of a data frame, then semantically it no longer belongs to the initial class, although the object may still formally have the class label. Hence none of the S3 methods can be safely applied to the object.

Is it possible to somehow check the validity of the data frame each time it is modified, and if the structure changes, remove the class label? What's the correct way to modify the data frame (or for that matter any object) and preserve its structure?

UPD: By modification I mean not only the base-R [ setters, but also the result of applying a tidyverse pipeline, e.g. select'ing, mutate'ing, ... -- these verbs do not modify the original object, but rather return a modified one instead, copying the original attributes.

One option I can think of is to wrap the data frame in an R6 class:

ZDf <- R6::R6Class(
    'ZDf',
    private = list(
        .Data = NULL
    ),
    public = list(
        initialize = function(z) {
            # Do checks here
            # This duplicates set(), but in general may implement completely
            # different logic
            private$.Data <- z
            invisible(private$.Data)
        },
        set = function(z) {
            # Do checks here
            private$.Data <- z
            invisible(private$.Data)
        },
        get = function() {
            private$.Data
        }
    )
)

And use it like so:

z <- ZDf$new(mtcars[1:10,])
z$get() %>%
    group_by(cyl) %>%
    summarise(mean(mpg)) %>%
    z$set()

I read that S4 has some support for S3 classes, and an S4-wrapped object looks almost like a vanilla S4 object:

> setClass('XDf', contains = 'data.frame')
> x <- new('XDf', mtcars[1:3,1:3]) 
> x
Object of class "XDf"
               mpg cyl disp
Mazda RX4     21.0   6  160
Mazda RX4 Wag 21.0   6  160
Datsun 710    22.8   4  108
> x[1,]
          mpg cyl disp
Mazda RX4  21   6  160

However, it is impossible to use it in hadleyverse pipelines:

> x %>% slice(1)
Error in `stop_vctrs()`:
! Input must be a vector, not a <XDf> object.
Run `rlang::last_error()` to see where the error occurred.

Mossa · Answer 1 · 2022-05-25T16:06:49.967

There's a different way to achieve this. To illustrate this, let us say that the constraints are, that the transformations has to retain the same columns as iris. This constraint is encoded in a check_ZDf function:

check_ZDf <- function(x, ...) {
  all(colnames(iris) %in% colnames(x))
}

And then we could use a function to check if an object is of ZDf class:

is.ZDf <- function(x) {
  inherits(x, "ZDf")
}

And let us make a function that takes any tibble/data.frame and amends the ZDf class, if it indeed satisfies it:

create_ZDf_from <- function(x, ...) {
  stopifnot("`x` is already instance of `ZDf`" = is.ZDf(x),
            check_ZDf(x))
  structure(x, class = c("ZDf", class(x)))
}

Now, we can just ensure that the constraint is met, after each mutations of our data variable. This means we have to check after assignments when accessing [, $, and of course [[.

`[[<-.ZDf` <- function(x, ...) {
  result <- NextMethod()
  stopifnot(check_ZDf(result))
  result
}

`[<-.ZDf` <- function(x,...) {
  result <- NextMethod()
  stopifnot(check_ZDf(result))
  result
}

`$<-.ZDf` <- function(x, ...) {
  result <- NextMethod()
  stopifnot(check_ZDf(result))
  result
}

Thus you can try these things:

example_df <- as_tibble(iris[sample.int(ncol(iris), 
  size = 100, replace = TRUE),])
ZDf_example_df <- create_ZDf_from(example_df)

ZDf_example_df[["Sepal.Length"]] <- NULL # doesn't work
ZDf_example_df["Sepal.Length"] <- rnorm(nrow(ZDf_example_df)) # works!
ZDf_example_df["Sepal.Length"] <- NULL # doesn't work
ZDf_example_df$Sepal.Length <- NULL    # doesn't work

ZDf_example_df$New <- rnorm(1) # works

moved things around, as `is.ZDf` should only if something has the class `is.ZDf`... I think.. — Mossa, May 25 '22 at 16:07
I like your approach, however this won't work in a typical tidyverse pipeline, because tidy verbs create a new object and then copy the old attributes, e.g. `ZDf_example_df %>% select(Species) %>% class()` yields `[1] "ZDf" "data.frame"` — Dmitry Zotikov, May 25 '22 at 17:42
thank you very much for your time, I'll update the question to clarify what I mean by "modification" — Dmitry Zotikov, May 25 '22 at 17:45
Yeah, unfortunately, there is no way to hook into an object in a better way........ — Mossa, May 25 '22 at 19:52

score 1 · Answer 2 · answered May 30 '22 at 14:38

Your problem statement is twofold: you are looking to have a strict control over the object state while still exposing the data frame to a variety of external methods directly. Your own suggestion with R6 classes is a standard OOP approach, and it's more or less the same idea suggested by @Mossa: only a handful of declared (known in advance) methods are allowed to manipulate the object state. It's technically feasible to extend all allowed tidyverse methods for the ZDf S3 class, but it seems cumbersome. I don't see any other, more elegant way of achieving this.

However, if your usage scenario allows this, you may relax the hard guarantee of the object state. Instead of controlling each operation (thus, once again, requiring to have a full list of allowed methods), you may only check/enforce the desired semantics where it is absolutely necessary.

is.ZDf <- function(x) {
    inherits(x, "ZDf")
}
check_ZDf <- function(x) {
    "mpg" %in% names(x) && !any(is.na(x$mpg))
}
init_ZDf <- function(x) {
    if (is.ZDf(x)) return(x)
    if (!check_ZDf(x)) stop("ZDf condition doesn't hold!")
    structure(x, class = c("ZDf", class(x)))
}

print.ZDf <- function(x) {
    if (!check_ZDf(x)) {
        if (!("mpg" %in% names(x))) {
            stop("This is not a valid ZDf!")
        }
        warning("Recovering ZDf...")
        x$mpg[is.na(x$mpg)] <- mean(x$mpg, na.rm = TRUE)
    }
    NextMethod("print")
}

In this example, the object is being checked for validity in the S3 print.

init_ZDf(iris) # fails
zdf <- init_ZDf(head(mtcars))
print(zdf) # succeeds
zdf %>% 
    mutate(mpg = ifelse(mpg > 20, NA, mpg)) %>% 
    print # recovers and succeeds
zdf %>% 
    select(-mpg) %>% 
    print # fails

Of course, the biggest downside of this approach is that we do not know when the desired object state was violated. In some cases, this is justified; in other, this may be considered a bad call, especially if users of the ZDf class are not aware of the required semantic properties or may violate them easily or unintentionally.

Preserve the structure of R data frame upon modification (in tidy pipelines)

2 Answers2