13

A common idiom (found in books, tutorials, and on many Stack Overflow questions) is to use df as a sort of throw-away identifier for a dataframe. I've done so hundreds of times with seemingly no ill-effect, but then ran into the following code:

library(tree)
df <- droplevels(iris[1:100,c(1,2,5)])
tr <- tree(Species ~ ., data = df)
plot(tr)
text(tr)
partition.tree(tr)

This gives the following error message:

Error in as.data.frame.default(data, optional = TRUE) : 
  cannot coerce class ""function"" to a data.frame

I discovered by trial and error that if I simply replace df above by df2, the code works as expected. It is true that df is the name of the density function for the F-distribution, but that doesn't seem to be remotely relevant here. Is this a bug in the tree package, or is it an important cautionary tale whose moral is that I should avoid using df as the name for a dataframe since doing so introduces a name-clash?

bjarchi
  • 180
  • 1
  • 8
John Coleman
  • 51,337
  • 7
  • 54
  • 119
  • 7
    Most people do avoid object naming conflicts with common functions like this, yes, but actual problems are _very_ rare in my experience. – joran May 03 '18 at 20:05
  • 2
    I do not get that error, but I do get `partition.tree(tr)` then "Error in terms.formula(formula, data = data) : 'data' argument is of the wrong type". I suspect you have a package loaded whose `droplevels` function is masking the behavior of `droplevels` in pkg:base. – IRTFM May 03 '18 at 20:08
  • 3
    In this case, it's because `eval()` is bring run with the wrong environment deep inside `tree::model.frame.tree`. I'd consider it a bug in the `tree` package in this case. But in general, do avoid standard function names. – MrFlick May 03 '18 at 20:09
  • @MrFlick: So you could explain that error to the package authors. (I wonder if it is the cause of the error I get?) Please report back. – IRTFM May 03 '18 at 20:12
  • @42- Actually, i just realized I was getting the same error as you, different from the OP. Tested with `tree_1.0-37` and `R 3.4.1` – MrFlick May 03 '18 at 20:14
  • I was using R 3.4.3 and tree 1.0-39 – IRTFM May 03 '18 at 20:18
  • I repro the OP's error with R 3.3.3 tree 1.0-39. Either way, looks like a bug in how the package is looking up `df`. – Frank May 03 '18 at 20:21
  • Interesting. I get the OP's stated error with R 3.4.3 and tree 1.0-39 (mac 10.12.5). I agree with @MrFlick that I would consider this a bug in how tree is eval-ing some original call. – joran May 03 '18 at 20:22

2 Answers2

2

Is this a bug in the tree package, or is it an important cautionary tale whose moral is that I should avoid using df as the name for a dataframe since doing so introduces a name-clash?

I think in this case it may be both, but for your purposes I would take it more as a cautionary example. The fact that it causes an error here indicates that it may not be the best practice.

In my experience R does not manage namespaces very well (comparing it to Python, for example). Because of this, it may have been unwise for the authors of tree to introduce (intentionally or not) a conflict with df - which is a common throwaway name for a dataframe - if in fact they did so (see comments here and in the question; it is unclear whether this is a clash in data.frame names or improper use of eval() causing clashes between data.frame objects and functions).

With that said, it is a good example of why namespaces are important and (IMO) suggestive of how to write better R code. I think namespaces are being introduced to the R ecosystem, but my experience with R is that there is a lot of namespace 'flatness' and lots of opportunities for name conflicts. For this reason I would suggest that you take this as a reason to use more descriptive / unique identifiers for your own variables. This avoids conflicts like the one you encountered, and provides some future-proofing to help avoid conflicts creeping into previously working code if package internals change.

bjarchi
  • 180
  • 1
  • 8
  • I don't think the authors of tree are "introducing a conflict with `df`". The comments on the question suggest that they are using `eval` in a bad way, which would probably cause an error for a user-named data frame sharing a name with any function. – Gregor Thomas May 03 '18 at 20:52
  • From MrFlick's comments it seems that the improper use of eval() is the cause of a different problem, not the one exposed by the OP, although I take your point. I'll edit to state that this could be a more subtle and specific bug in tree that exposes this problem, but I think my point about namespaces and using unique identifiers stands. – bjarchi May 03 '18 at 22:40
  • 2
    Yes, I agree with most of your answer. My point is simply that it is not the package authors who are using `df`, but OP. When you say *"may have been unwise for the authors of tree to introduce (intentionally or not) a conflict with `df`*" it makes it sound like that package authors used `df` improperly, when it seems like they are not using `df` at all. It also makes it sound like the problem is limited to the `df` term, but almost certainly if one data frame sharing a function's name is a problem, others would be as well, whether commonly used as data frame names like `data` or `dt`, or not. – Gregor Thomas May 04 '18 at 01:48
  • I accept your point, and I edited my original answer to call that out - does the edited text (which calls out these comments) address your concerns? I don't have a copy of R and tree on this machine to dig into whether or not there is an actual conflict with the name df. – bjarchi May 04 '18 at 02:07
  • A search of the package on Github shows [that the only time the package authors use `df` it is the name of a list item](https://github.com/cran/tree/search?utf8=%E2%9C%93&q=df&type=), looks like it holds the degrees of freedom for a fitted model. So I think your general points about namespaces are good, but your specific comments about what that package authors are unfounded. – Gregor Thomas May 04 '18 at 13:57
2

Because the potential name conflict would make errors more difficult to debug, I forced myself to use dtf instead of df for a long time. However important collection of package in the tidyverse seem to be ok with using df everywhere in their tests, for example test-select.r:

  df <- tibble(g = 1:3, x = 3:1) %>% group_by(g)

I've been using df a lot recently to name python data frames. So I tend to use df in R as well nowadays. Let's see if this bites back.

Flat or nested namespace

The question of namespace is not part of the original question but it is related to this issue of name conflict with df. A flat name space is easier and fun to use in exploratory data analysis, you just call all functions directly, but it can lead to collisions. A nested namespace makes debugging more reliable at the cost of being a little more cumbersome, because you have to prefix each function call with the package name.

Name space collisions are less of an issue in python because it has a more nested namespace. For example you import numpy as np and prefix all numpy function calls with np, such as np.array(). (It's possible to do from numpy import * but it is frowned upon and linters typically complain about it).

In R you have to distinguish trash code used in exploratory data analysis from more durable code that you are going to reuse. In the second case, if you use only one or a few functions from another package, it's better not to import the package library(package_name) but to call the functions you really need with package_name::function.

Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110
  • 1
    Nice answer. After being careful to avoid `df` for a while after asking this question I have also fallen back into the habit of using it. Arguably, it is a bad idea which has nevertheless become idiomatic. As far as name spaces go, it is a small thing but `::` is more annoying than `.` to type, which is perhaps one reason why I'm a little less likely to use `zoo::rollmean` in R than I am to use `math.log` in Python. – John Coleman May 27 '20 at 11:27