3

I am encountering an issue when I use the extraction operator `$() inside of a function. The problem does not exist if I follow the same logic outside of the loop, so I assume there might be a scoping issue that I'm unaware of.

The general setup:

## Make some fake data for your reproducible needs.
set.seed(2345)

my_df <- data.frame(cat_1 = sample(c("a", "b"), 100, replace = TRUE),
                    cat_2 = sample(c("c", "d"), 100, replace = TRUE),
                    continuous  = rnorm(100),
                    stringsAsFactors = FALSE)
head(my_df)

This process I am trying to dynamically reproduce:

index <- which(`$`(my_df, "cat_1") == "a")

my_df$continuous[index]

But once I program this logic into a function, it fails:

## Function should take a string for the following:
##  cat_var - string with the categorical variable name as it appears in df
##  level - a level of cat_var appearing in df
##  df - data frame to operate on.  Function assumes it has a column 
##    "continuous".
extract_sample <- function(cat_var, level, df = my_df) {

  index <- which(`$`(df, cat_var) == level)

  df$continuous[index]

}

## Does not work.
extract_sample(cat_var = "cat_1", level = "a")

This is returning numeric(0). Any thoughts on what I'm missing? Alternative approaches are welcome as well.

apax
  • 160
  • 7
  • Fwiw, I would do `extract_sample = function(var, val, df = my_df) merge(df, setNames(data.frame(val), var), all.y=TRUE)` or similar. – Frank Apr 26 '18 at 16:39

3 Answers3

4

The problem isn't the function, it's the way $ handles the input.

cat_var = "cat_1"
length(`$`(my_df,"cat_1"))
#> [1] 100
length(`$`(my_df,cat_var))
#> [1] 0 

You can instead use [[ to achieve your desired outcome.

cat_var = "cat_1"
length(`[[`(my_df,"cat_1"))
#> [1] 100
length(`[[`(my_df,cat_var))
#> [1] 100

UPDATE

It's been noted that using [[ this way is ugly. And it is. It's useful when you want to write something like lapply(stuff,'[[',1)

Here, you should probably be writing it as my_df[[cat_var]].

Also, this question/answer goes into a little more detail about why $ doesn't work the way you want it to.

Mark
  • 4,387
  • 2
  • 28
  • 48
  • What's ugly? The way R handles it or my example? – Mark Apr 26 '18 at 16:26
  • 4
    @HongOoi I share your aesthetic judgement, but perhaps it could be expressed differently? ;) I think the point here is that we should probably be using `[[` with it's usual semantics: `my_df[[cat_var]]`. – joran Apr 26 '18 at 16:27
  • The main reason to be doing it this way is to support the `apply` family – Mark Apr 26 '18 at 16:28
  • I think maybe it was a little confusing because the OP didn't really need to use `[[` in that sort of anonymous function passed to `lapply` context, that's all. – joran Apr 26 '18 at 16:34
  • 1
    Thanks guys, the ``[[`` operator is exactly what I was forgetting. This simplifies things quite a bit. – apax Apr 26 '18 at 16:35
  • Fair, I was just trying to imitate the style they were using. Updated the answer to show the more visually pleasing form. – Mark Apr 26 '18 at 16:35
3

The problem is that the $ is non-standard, in the sense that when you don't quote the parameter input, it still tries to parse it and use what you typed, even if that was meant to refer to another variable.

Or more simply, as @42 put it in the first comment in the linked question:

The "$" function does not evaluate its arguments, whereas "[[" does`.

Here's a much simpler data set as an example.

my_df <- data.frame(a=c(1,2))
v <- "a"

Compare the usual usage; the first two give the same result, if you don't quote it, it parses it. So the third one (now) clearly doesn't work properly.

my_df$"a"
## [1] 1 2

my_df$a
## [1] 1 2

my_df$v
## NULL

That's exactly what's happening to you:

`$`(my_df, "a")
## [1] 1 2

`$`(my_df, v)
## NULL

Instead we need to evaluate v before sending to $ by using do.call.

do.call(`$`, list(my_df, v))
## [1] 1 2

Or, more appropriately, use the [[ version which does evaluate the parameters first.

`[[`(my_df, v)
## [1] 1 2
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • 1
    Besides that link, there's also https://stackoverflow.com/a/18228613/ which quotes the source code for the "not evaluated" point. – Frank Apr 26 '18 at 16:47
1

Problem lies in the way you are indexing to the column. This works just making a slight tweak to yours:

extract_sample <- function(cat_var, level, df = my_df) {
  index <- df[, cat_var] == level
  df$continuous[index]
}

Using it dynamically:

> extract_sample(cat_var = "cat_2", level = "d")
 [1] -0.42769207 -0.75650031  0.64077840 -1.02986889  1.34800344  0.70258431  1.25193247
 [8] -0.62892048  0.48822673  0.10432070  1.11986063 -0.88222370  0.39158408  1.39553002
[15] -0.51464283 -1.05265106  0.58391650  0.10555913  0.16277385 -0.55387829 -1.07822831
[22] -1.23894422 -2.32291394  0.11118881  0.34410388  0.07097271  1.00036812 -2.01981056
[29]  0.63417799 -0.53008375  1.16633422 -0.57130500  0.61614135  1.06768285  0.74182293
[36]  0.56538633  0.16784205 -0.14757303 -0.70928924 -1.91557732  0.61471302 -2.80741967
[43]  0.40552376 -1.88020372 -0.38821089 -0.42043745  1.87370600 -0.46198139  0.10788358
[50] -1.83945868 -0.11052531 -0.38743950  0.68110902 -1.48026285
rg255
  • 4,119
  • 3
  • 22
  • 40
  • Note that this will work with regular data frames, but if you use tidyverse functions that return tibbles, `df[, x]` will return a 1-column tibble. Best to use `df[[x]]` instead, which works with everything. – Hong Ooi Apr 26 '18 at 16:45
  • Or just not use tidyverse ;) (thanks and it's a good point to bear in mind) – rg255 Apr 26 '18 at 16:47