6

When I teach people how to use dplyr, I warn them not to assume that any dplyr functions will preserve the order of their dataframes/tibbles unless otherwise stated by documentation. However, I have not been able to find any official documentation on the matter, which makes it more difficult to convince people that they should be more careful about assuming what their code is doing. For example, the mutate() explicitly guarantees that the number of rows will be preserved, but doesn't say anything about order preservation. Is there any official statement or documentation associated with dplyr (or tidyverse) about what, if any, assumptions can be made with regards to row order preservation in functions that I can point people to?

anjama
  • 400
  • 2
  • 9
  • As far as I know, `mutate` preserves the order but `summarise` does not, it sorts in increasing order. – Rui Barradas Feb 11 '20 at 16:27
  • 2
    I'd post this on the RStudio Tidyverse forum as well: https://community.rstudio.com/c/tidyverse/6 – cardinal40 Feb 11 '20 at 16:27
  • I can clearly see that usually the Tidyverse should preserve order. Look at a recent [bug fix where one of the functions didn't](https://github.com/tidyverse/dplyr/blob/92a3fbb275b897ee0250a1d58b9623b5b1dff833/NEWS.md#minor-improvements-and-bug-fixes), [this test](https://github.com/tidyverse/dplyr/blob/e0daacc4243e87eab1864891d24762c194b1e798/tests/testthat/test-filter.r#L306) or in fact several functions where it is explicitly stated that it changes the order. But I'm not sure this implicit rule is reflected in any explicit principle. – JBGruber Feb 11 '20 at 16:37
  • Do you have examples of the functions you're thinking of *not* preserving order? – camille Feb 11 '20 at 16:48
  • @JBGruber It's interesting that the [issue](https://github.com/tidyverse/dplyr/issues/3989) associated with the test you link demonstrates exactly why this important. Someone made an assumption about order being preserved, whereas the developers did not, and then a change in the package broke fragile code based on said assumption. This was just 14 months ago. Coincidentally, the very last comment in the issue before it was closed raises the exact same concern I have here (with no response). – anjama Feb 12 '20 at 02:13
  • @camille The issue isn't how it behaves like someone would expect now, but on whether it can be expected to behave in the future the same way it does now. Without some sort of explicit documentation (what I'm trying to find, if it exists), you lose the guarantee that your code will behave the same in the future. See the link in my previous comment above for a situation where this literally became an issue not long ago for a user of dplyr – anjama Feb 12 '20 at 02:19
  • I think your question is pretty clear. You want to know if it is an agreed upon and documented principle in the tidyverse that row order should be preserved. In my opinion, it might be worth opening an issue. Either in dplyr or in the Tidyverse package itself. – JBGruber Feb 12 '20 at 08:47

1 Answers1

1

This is from the Roxygen comments in the mutate source code:

For mutate():

  • Rows are not affected.

  • Existing columns will be preserved unless explicitly modified.

  • New columns will be added to the right of existing columns.

  • Columns given value NULL will be removed Groups will be recomputed if a grouping variable is mutated.

  • Data frame attributes are preserved.

For transmute():

  • Rows are not affected.

  • Apart from grouping variables, existing columns will be remove unless explicitly kept.

  • Column order matches order of expressions.

  • Groups will be recomputed if a grouping variable is mutated.

  • Data frame attributes are preserved.

Which I would interpret as saying that row order is preserved. Since it comes from the source code, I would take it as canonical.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • The problem is that if someone is overly pedantic (like me, I fully admit), those statements are in large part vague in terms of their original intention and interpretation. They also don't tell me about the rest of the package. More importantly, I guarantee you that the scientists I'm helping to introduce to R, dplyr, etc, are generally not the type that are going to be read through comments in package source code (most are the type that want nothing to do with programming at all); getting them to read any documentation in general is hard enough as it is. – anjama Feb 12 '20 at 02:33
  • @anjama I agree the wording could be better, but what else could "the rows are unaffected' mean? If you're wondering about the rest of the package, we know that some functions like summarise _do_ change the order, so you need to consider each function on its own merit. It doesn't really make sense to ask if a _package_ maintains ordering. Also, you don't need to direct the scientists to the documentation. You are teaching them, so you only need to satisfy yourself by reading the source code. – Allan Cameron Feb 12 '20 at 05:56