0

When creating data frames with multiple variables using the data.frame() function, each variable cannot be a function of other variables generated within data.frame(). This is demonstrated in the code sample below, where Example 1 succeeds because the expressions for x and y don't require any object in our environment and Example 2 returns an error because x is not in the global environment.

Why does this happen?

I can think of two possible explanations, but I do not know how to evaluate them (pun intended):

  • Scoping: each assignment expression is evaluated sequentially (i.e. x is assigned then y is assigned) but only looks for objects in the environment in which data.frame() was called. Since data.frame() was called in the global environment but x is not in the global environment, an error is returned in Example 2. This may also be why y = 6 rather than y = 1 in Example 3.

  • Evaluation: all assignment expressions are evaluated simultaneously (i.e. in parallel), causing x to not exist in any environment at the time y is assigned a value that is a function of x. While R employs lexical (i.e. static) scoping, perhaps data.frame() is designed to look for x in both the environment in which x was called and the child environments within the function.

# Example 1 (success)
data.frame(x = 0, y = 0 + 1)
#>   x y
#> 1 0 1

# Example 2 (failure)
data.frame(x = 0, y = x + 1)
#> Error in data.frame(x = 0, y = x + 1): object 'x' not found

# Example 3
x <- 5
data.frame(x = 0, y = x + 1)
#>   x y
#> 1 0 6

Note: I am trying to understand why data.frame() exhibits this behavior. As observed in the comments and demonstrated below, tibble::tibble() is an excellent option for users who wish to generate variables in a data.frame conditional on other variables in the data.frame.

library(tibble)

# Tibble Example 1: y uses x!
tibble(x = 0, y = x + 1)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     0     1

# Tibble Example 2: y uses x, ignoring the global x!
x <- 5
tibble(x = 0, y = x + 1)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     0     1
socialscientist
  • 3,759
  • 5
  • 23
  • 58
  • 1
    You could instead use `tibble`, then this works `tibble(x = 0, y = x + 1)` – Jonathan Jun 09 '22 at 22:22
  • 1
    The reason this does not work with `data.frame` is R's lazy-evaluation (`tibble` knows the "order" of the arguments, and evaulates them in the order passed, whereas `data.frame` columns even cannot access the scope of the dataframe itself) - so a combination of both – Jonathan Jun 09 '22 at 22:25
  • Yes, I know this. I've added an explanation clarifying my intent with the post and showing tibble's functionality. – socialscientist Jun 09 '22 at 22:36
  • @Jonathan I'm not sure about the lazy evaluation point...is there a way to demonstrate that data.frame() is using lazy evaluation? I know tibble does, but it only "fixes" this issue because it evaluates sequentially https://www.rdocumentation.org/packages/tibble/versions/1.4.2/topics/tibble – socialscientist Jun 09 '22 at 22:41
  • 3
    Could you clarify what *specifically* you are hoping to get in an answer. It was already pointed out that one of the key differences between the `data.frame` and `tibble` constructors is the fact that `tibble` builds columns sequentially, whereas `data.frame` does not. What additional details/information are you looking for? I feel it is going to be difficult to satisfactorily address a question of the form "why is this so". – Maurits Evers Jun 09 '22 at 23:11
  • 2
    You may have already come across this post but if not this may be interesting & relevant: [What can a data frame do that a tibble cannot?](https://stackoverflow.com/questions/66466656/what-can-a-data-frame-do-that-a-tibble-cannot) – Maurits Evers Jun 09 '22 at 23:20
  • 3
    Not really an answer to "why" `data.frame` does things the way it does, but the [source code](https://github.com/SurajGupta/r-source/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/base/R/dataframe.R#L437) shows "what" the `data.frame` constructor does. From a quick/superficial look, `data.frame` simply iterates through the list of all (column) vectors, does some dimension checking & recycling (if necessary), and then stores them in a dressed `list`, i.e. your final `data.frame`. This is just a simple `for` loop, and processing element *i* has no reference to element *i-1*. – Maurits Evers Jun 10 '22 at 00:34
  • @MauritsEvers A satisfactory answer would show that `data.frame()` internally is doing something where arguments are evaluated "simultaneously" (presumably via vectorization since nothing is parallelized in base R) so this cannot happen and/or where it is evaluating objects in the scope of the call rather than within. I have zero interest in what `tibble()` does (not) do and only added that note for those who might find this question with the hope of finding something with the documented behavior of `tibble()`. – socialscientist Jun 10 '22 at 07:12
  • @user3614648 `data.frame()` does *not* evaluate "simultaneously". Quite the contrary. It processes *sequentially*, but there is no eval. It's just a simple `for` loop (please take a look at my previous comment and the link to the `data.frame` constructor source code!). It's not clear to me what more you're after. The source code shows you exactly what's going on. There is no "magic" there, no (lazy) eval, no quotation. – Maurits Evers Jun 10 '22 at 08:08

0 Answers0