9

Fans of the Tidyverse regularly give several advantages of using tibbles rather than data frames. Most of them seem designed to protect the user from making mistakes. For example, unlike data frames, tibbles:

  • Don't need a ,drop=FALSE argument to not drop dimensions from your data.
  • Will not let the $ operator do partial matching for column names.
  • Only recycle your input vectors if they are of exactly length one.

I'm steadily becoming convinced to replace all of my data frames with tibbles. What are the primary disadvantages of doing so? More specifically, what can a data frame do that a tibble cannot?

Preemptively, I would like to make it clear that I am not asking about data.table or any big-picture objections to the Tidyverse. I am strictly asking about tibbles and data frames.

J. Mini
  • 1,868
  • 1
  • 9
  • 38
  • 1
    Tibbles are data frames - _i.e._ they have class `data.frame`, just with additional methods. So it's not so much what's different about a data frame, as how tibble modifies data frame behaviour. The differences are captured in the [tibbles vignette](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html). Personally I think the modified `print` method is the most useful feature. – neilfws Mar 04 '21 at 00:06
  • 2
    Not an answer, but a long interesting thread on R-package-devel with several R authorities who discusses some of the implications of tibbles: [tibbles are not data frames](https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896.html). In any case, Jim Lemon's metaphor on including tigers in the mixed martial arts competitions makes it worth reading thread. – Henrik Mar 04 '21 at 00:40
  • @Henrik Certainly a fun read, but the summary seems to be as simple as "tibbles violate the Liskov substitution principle". – J. Mini Mar 05 '21 at 20:21
  • Although "Matrix indexing [of a `data.frame`] `x[i]` with a logical or a 2-column integer matrix `i` using `[` is not recommended" (`?[.data.frame`) it can be handy (e.g. [here](https://stackoverflow.com/questions/18056799/index-a-data-frame-row-by-row-using-column-names-selected-from-a-variable), [here](https://stackoverflow.com/questions/25584039/using-row-wise-column-indices-in-a-vector-to-extract-values-from-data-frame)). It seems like such indexing can't be used on a tibble. `tb = tibble(x = 1:3, y = 4:6)`; `m = cbind(c(3, 2), c(2, 1))`; `tb[m]`; `df = as.data.frame(tb)`; `df[m]` – Henrik Mar 08 '21 at 08:11
  • But again, this is supposedly just another example of protecting the tibble user from self-inflicted harm (coercion) – Henrik Mar 08 '21 at 08:43
  • @Henrik Interesting, I can't see that anywhere in the documentation for tibbles. The one place that I found only mentions it for assignment, not subsetting. – J. Mini Mar 08 '21 at 13:02
  • Haven't checked the docs, but found a [News item](https://github.com/tidyverse/tibble/blob/master/NEWS.md#tibble-11): "Strict checking of integer and logical column indexes [...] Passing a matrix or an array now raises an error in any case" – Henrik Mar 08 '21 at 13:15
  • 1
    One thing that I found tibbles are lacking are row-names. Though this is by design, it is sometimes a bit annoying when calculating distances. Using `tibble::column_to_rownames()` does work, but changes the type back to `data.frame` – Max Teflon Mar 12 '21 at 15:05
  • My advice : go directly to data.table ! – MrSmithGoesToWashington Mar 13 '21 at 13:14

2 Answers2

3

From the trouble with tibbles, you can read :

there isn’t really any trouble with tibbles

However,

Some older packages don’t work with tibbles because of their alternative subsetting method. They expect tib[,1] to return a vector, when in fact it will now return another tibble.

This is what @Henrik pointed out in comments.

As an example, the length function won't return the same result:

library(tibble)
tibblecars <- as_tibble(mtcars)
tibblecars[,"cyl"]
#> # A tibble: 32 x 1
#>      cyl
#>    <dbl>
#>  1     6
#>  2     6
#>  3     4
#>  4     6
#>  5     8
#>  6     6
#>  7     8
#>  8     4
#>  9     4
#> 10     6
#> # ... with 22 more rows
length(tibblecars[,"cyl"])
#> [1] 1
mtcars[,"cyl"]
#>  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
length(mtcars[,"cyl"])
#> [1] 32

Other example :

Invariants for subsetting and subassignment explains where the behaviour from tibble differs from data.frame.

These limitations being known, the solution given by Hadley in interacting with legacy code is:

A handful of functions don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame:

Waldi
  • 39,242
  • 6
  • 30
  • 78
  • 1
    and an example in the wild [here](https://stackoverflow.com/questions/54922515/bnlearn-r-error-variable-variable1-must-have-at-least-two-levels/54942022#54942022) – user20650 Mar 07 '21 at 13:12
  • 1
    The fact that `length` is an example of what you're pointing out seems non-obvious. The way that you've worded it makes it sound like the `length` function is at fault, making it easy to be fooled in to thinking that this is some S3 trickery. What's really going on, and what you were trying to show, is that `length` returns `1` because it's giving the number of columns in the tibble `tibblecars[,"cyl"]`, which is one, whereas `length(mtcars[,"cyl"])` is 32 because `mtcars[,"cyl"]` is a vector - not a data frame - of 32 elements. In short, `length` is behaving correctly... on the wrong data. – J. Mini Mar 07 '21 at 15:28
  • I didn't say length is faulty, it just returns another result depending on object type (dataframe or tibble), which might cause packages using it to get faulty. – Waldi Mar 07 '21 at 15:30
  • 1
    I might take some offense at labelling code that follows classic data frame semantics as "legacy" code ... – Ben Bolker Jun 10 '22 at 02:16
  • @Ben Bolker, didn't mean any offense, just quoted Hadley ;-) – Waldi Jun 10 '22 at 05:21
0

Learned here: https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html

There are three key differences between tibbles and data frames:

  • printing
  • subsetting
  • recycling rules

Tibbles:

  • Never change an input’s type (i.e., no more stringsAsFactors = FALSE!)
  • Never adjust the names of variables
  • Evaluate arguments lazily and sequentially
  • Never use row.names()
  • Only recycle vectors of length 1

Large data frames are displayed with as many rows as possible until the memory buffer is overwhelmed. R will stop in this situation at an arbitrary section of the data frame.

In tibble format only the first ten rows and all fitting columns are displayed. Colum data type and size of the data set is also displayed.

TarJae
  • 72,363
  • 6
  • 19
  • 66
  • 3
    This is a good start, but I think that what a data frame can do that a tibble cannot isn't so much about listing the differences between the two data types, it's about listing the consequences of those differences. – J. Mini Mar 06 '21 at 15:22