R - How to extract object names from expression

Question

Given an rlang expression:

expr1 <- rlang::expr({
  d <- a + b
})

How to retrieve the names of the objects refered to within the expression ?

> extractObjects(expr1)
[1] "d" "a" "b"

Better yet, how to retrieve the object names and categorise them by "required"(input) and "created"(output) ?

> extractObjects(expr1)
$created
[1] "d"

$required
[1] "a" "b"

Konrad Rudolph · Accepted Answer · 2020-09-09T12:14:49.413

The base function all.vars does this:

〉all.vars(expr1)
[1] "d" "a" "b"

Alternatively, you can use all.names to get all names in the expression rather than just those that aren’t used as calls or operators:

〉all.names(expr1)
[1] "{"  "<-" "d"  "+"  "a"  "b"

Don’t be misled: this result is correct! All of these appear in the expression, not just a, b and d.

But it may not be what you want.

In fact, I’m assuming what you want corresponds to the leaf tokens in the abstract syntax tree (AST) — in other words, everything except function calls (and operators, which are also function calls).

The syntax tree for your expression looks as follows:¹

   {
   |
   <-
   /\
  d  +
    / \
   a   b

Getting this information means walking the AST:

leaf_nodes = function (expr) {
    if(is.call(expr)) {
        unlist(lapply(as.list(expr)[-1L], leaf_nodes))
    } else {
        as.character(expr)
    }
}

〉leaf_nodes(expr1)
[1] "d" "a" "b"

Thanks to the AST representation we can also find inputs and outputs:

is_assignment = function (expr) {
    is.call(expr) && as.character(expr[[1L]]) %in% c('=', '<-', '<<-', 'assign')
}

vars_in_assign = function (expr) {
    if (is.call(expr) && identical(expr[[1L]], quote(`{`))) {
        vars_in_assign(expr[[2L]])
    } else if (is_assignment(expr)) {
        list(created = all.vars(expr[[2L]]), required = all.vars(expr[[3L]]))
    } else {
        stop('Expression is not an assignment')
    }
}

 〉vars_in_assign(expr1)
$created
[1] "d"

$required
[1] "a" "b"

Note that this function does not handle complex assignments (i.e. stuff like d[x] <- a + b or f(d) <- a + b very well.

_{¹ lobstr::ast shows the syntax tree differently, namely as}

_{█─`{`
└─█─`<-`
  ├─d
  └─█─`+`
    ├─a
    └─b}

_{… but the above representation is more conventional outside R, and I find it more intuitive.}

Thanks Konrad ! The `all.names` function was what I was searching for. It actually has this "functions" argument, so you can remove them: `> all.names(expr = expr1, functions = FALSE) [1] "d" "a" "b"` This did the trick, even though the categorisation into input / output objects would be an interesting functionality. — StephGC, Sep 09 '20 at 11:47
@StephGC That’s crazy, I hadn’t read the documentation in so long that I forgot about this usage. Note that there’s also `all.vars(expr)`, which does the same as `all.names(expr, functions = FALSE, unique = TRUE)`. — Konrad Rudolph, Sep 09 '20 at 11:59
@StephGC OK, check the update to my answer. It now also contains a rudimentary implementation that separates inputs and outputs. — Konrad Rudolph, Sep 09 '20 at 12:15

Artem Sokolov · Answer 2 · 2020-09-03T19:15:14.713

Another solution is to extract the abstract symbolic tree explicitly:

getAST <- function(ee) purrr::map_if(as.list(ee), is.call, getAST)

str(getAST(expr1))
#  List of 2
#   $ : symbol {
#   $ :List of 3
#    ..$ : symbol <-
#    ..$ : symbol d
#    ..$ :List of 3
#    .. ..$ : symbol +
#    .. ..$ : symbol a
#    .. ..$ : symbol b

Then traverse the AST to find the assignment(s):

extractObjects <- function(ast)
{
    ## Ensure that there is at least one node
    if( length(ast) == 0 ) stop("Provide an AST")

    ## If we are working with the assigment
    if( identical(ast[[1]], as.name("<-")) ) {
        ## Separate the LHS and RHS
        list(created = as.character(ast[[2]]),
             required = sapply(unlist(ast[[3]]), as.character))
    } else {
        ## Otherwise recurse to find all assignments
        rc <- purrr::map(ast[-1], extractObjects)

        ## If there was only one assignment, simplify reporting
        if( length(rc) == 1 ) purrr::flatten(rc)
        else rc
    }
}

extractObjects( getAST(expr1) )
# $created
# [1] "d"
#
# $required
# [1] "+" "a" "b"

You may then filter math operators out, if needed.

Thanks Artem ! This works well indeed for the categorisation, however for more complex expression more conditions would be needed I think: `expr2 <- rlang::expr({ if (a == b) { d <- 5 } else if (a == c) { d <- 2 } else { d <- 0 } })` Then in this case, the contents within the if() would be classified as "required" as well. — StephGC, Sep 09 '20 at 11:54

Valeri Voev · Answer 3 · 2020-09-03T18:54:07.897

This is an interesting one. I think that conceptually, it might not be clear in ALL possible expressions what exactly is input and output. If you look at the so called abstract syntax tree (AST), which you can visualize with lobstr::ast(), it looks like this.

So in simple cases when you always have LHS <- operations on RHS variables, if you iterate over the AST, you will always get the LST right after the <- operator. If you assign z <- rlang::expr(d <- a+b), then z behaves like a list and you can for example do the following:

z <- rlang::expr(d <- a+b)

for (i in 1:length(z)) {
  if (is.symbol(z[[i]])) {
    print(paste("Element", i, "of z:", z[[i]], "is of type", typeof(z[[i]])))
    if (grepl("[[:alnum:]]", z[[i]])) {print(paste("Seems like", z[[i]], "is a variable"))}
  } else {
    for (j in 1:length(z[[i]])){
      print(paste("Element", j, paste0("of z[[",i,"]]:"), z[[i]][[j]], "is of type", typeof(z[[i]][[j]])))
      if (grepl("[[:alnum:]]", z[[i]][[j]])) {print(paste("Seems like", z[[i]][[j]], "is a variable"))}
    }
  }
}
#> [1] "Element 1 of z: <- is of type symbol"
#> [1] "Element 2 of z: d is of type symbol"
#> [1] "Seems like d is a variable"
#> [1] "Element 1 of z[[3]]: + is of type symbol"
#> [1] "Element 2 of z[[3]]: a is of type symbol"
#> [1] "Seems like a is a variable"
#> [1] "Element 3 of z[[3]]: b is of type symbol"
#> [1] "Seems like b is a variable"

Created on 2020-09-03 by the reprex package (v0.3.0)

As you can see these trees can quickly get complicated and nested. So in a simple case like in your example, assuming that variables are using alphanumeric representations, we can kind of identify what the "objects" (as you call them) are and what are operators (which don't match the [[:alnum:]] regex). As you can see the type cannot be used to distinguish between objects and operators since it is always symbol (btw z below is a language as is z[[3]] which is why we can condition on whether z[[i]] is a symbol or not and if not, dig a level deeper). You could then (at your peril) try to classify that the objects that appear immediately after a <- are "outputs" and the rest are "inputs" but I don't have too much confidence in this, especially for more complex expressions.

In short, this is all very speculative.

Using regex won’t work here, since object names can be completely arbitrary in R. `foo + bar #!` is a valid R name, when surrounded by backticks. — Konrad Rudolph, Sep 03 '20 at 19:06
Hi @KonradRudolph - I agree, the example I provided is very "narrow", I was rather exploring the topic than providing a robust answer, but it wouldn't have fitted in a comment. — Valeri Voev, Sep 03 '20 at 19:22
Hi Valeri, thanks for your input. Yes indeed I implemented something like this but sure enough, it breaks with more complex expressions that contain e.g `if(a ==b)...` as then the contents within the if() would be classified as inputs as well. An interesting problem though. I saw within advanced R (https://adv-r.hadley.nz/expressions.html#finding-all-variables-created-by-assignment) some tricks for finding "outputs" in this case, maybe a solution for "inputs" could be derived from that as well. — StephGC, Sep 09 '20 at 12:01
I think that the other answers here are anyways more robust and helpful than my suggestion, hopefully the recursive solution by @KonradRudolph will work also in the more complex cases where you have calls nested inside calls. However, I'd be careful in a more general context what is an input and what it is an output. The output of one assignment can be the input in the next one. Not to mention that you have things like `a <- b <- 1` where things can get even more muddy and for example the `vars_in_assign` returns that `a` is created while `b` is required, which is questionable. — Valeri Voev, Sep 09 '20 at 13:43

R - How to extract object names from expression

3 Answers3