15

I got a weird result today.

To replicate it, consider the following data frames:

x <- data.frame(x=1:3, y=11:13)
y <- x[1:3, 1:2] 

They are supposed to be and actually are identical:

identical(x,y)
# [1] TRUE

Applying t() to indentical objects should produce the same result, but:

identical(t(x),t(y))
# [1] FALSE

The difference is in the column names:

colnames(t(x))
# NULL
colnames(t(y))
# [1] "1" "2" "3"

Given this, if you want to stack y by columns, you get what you'd expect:

stack(as.data.frame(t(y)))
#   values ind
# 1      1   1
# 2     11   1
# 3      2   2
# 4     12   2
# 5      3   3
# 6     13   3

while:

stack(as.data.frame(t(x)))
#     values ind
# 1      1  V1
# 2     11  V1
# 3      2  V2
# 4     12  V2
# 5      3  V3
# 6     13  V3

In the latter case, as.data.frame() does not find the original column names and automatically generates them.

The culprit is in as.matrix(), called by t():

rownames(as.matrix(x))
# NULL
rownames(as.matrix(y))
# [1] "1" "2" "3"

A workaround is to set rownames.force:

rownames(as.matrix(x, rownames.force=TRUE))
# [1] "1" "2" "3"
rownames(as.matrix(y, rownames.force=TRUE))
# [1] "1" "2" "3"
identical(t(as.matrix(x, rownames.force=TRUE)), 
          t(as.matrix(y, rownames.force=TRUE)))
# [1] TRUE

(and rewrite stack(...) call accordingly.)

My questions are:

  1. Why as.matrix() treats differently x and y and

  2. how can you tell the difference between them?

Note that other info functions do not reveal differences between x, y:

identical(attributes(x), attributes(y))
# [1] TRUE
identical(str(x), str(y))
# ...
#[1] TRUE

Comments to solutions

Konrad Rudolph gives a concise but effective explanation to the behaviour outlined above (see also mt1022 for more details).

In short Konrad shows that:

a) x and y are internally different;
b) "identical is too is simply too lax by default" to catch this internal difference.

Now, if you take a subset T of the set S, which has all the elements of S, then S and T are exactly the same objects. So, if you take a data frame y, which has all the rows and columns of x, then x and y should be exactly the same objects. Unfortunately x \neq y!
This behaviour is not only counterintuitive but also obfuscated, that is, the difference is not self evident, but only internal and even the default identical function can't see it.

Another natural principle is that transposing two identical (matrix-like) objects produces identical objects. Again, this is broken by the fact that, before transposing, identical is "too lax"; after transposing, the default identical is enough to see the difference.

IMHO this behaviour (even if it is not a bug) is a misbehaviour for a scientific language like R.
Hopefully this post will drive some attention and the R team will consider to revise it.

Community
  • 1
  • 1
antonio
  • 10,629
  • 13
  • 68
  • 136
  • seems to be how the `row.names` are defined , as they are different in `dput(x)`, and `dput(y`). Maybe they are explicitly added when using ``[.data.frame`` – user20650 Apr 04 '17 at 14:35
  • You can use dput(x) and dput(y) and you will see that the row.names are stored in a different way. I think it's related to automatic row.names handling (check https://stat.ethz.ch/R-manual/R-devel/library/base/html/row.names.html details section for further info), no idea why subsetting returns different row.names though... and to be honest, it smells like an unexpected behavior to me – digEmAll Apr 04 '17 at 14:42
  • `identical(x, y, attrib.as.set=FALSE)` seems to pick up on differences ( noting the line in `?identical` "*Note that identical(x, y, FALSE, FALSE, FALSE, FALSE) pickily tests for exact equality."* – user20650 Apr 04 '17 at 14:59
  • The difference arises in as.matrix when it calls `.row_names_info` and as @digEmAll pointed out is because of the automatic row names in `x`. – oropendola Apr 04 '17 at 15:17
  • As it's written, `as.matrix` removes automatic row names so that they don't end up as row names in the matrix. – oropendola Apr 04 '17 at 15:19
  • "row.names" are either stored as a `length == nrow(x)` vector or as a compact form of type `c(NA, -nrow(x))` to avoid creating and carrying a `as.character(1:nrow(x))` vector around. When subsetting "x", `"[.data.frame"` has to create some form of `row.names` for the subsetted "x". Even if `x[c(1, 2, 3), ]` seems to not need "row.names", something like `x[c(2, 3, 1), ]` needs and `"[.data.frame"` needs to be consistent regarding its output... – alexis_laz Apr 04 '17 at 15:46
  • ...Nonetheles, R can see that in the first case, "row.names" are in increasing order hence it is stored as `c(NA, nrow(x))` (the object _has_ "row.names" but there's no need to create `1:nrow(x)`). In the second case a "row.names" attribute as `c(2, 3, 1)` has to be created. – alexis_laz Apr 04 '17 at 15:47

2 Answers2

5

identical is simply too lax by default but you can change that:

> identical(x, y, attrib.as.set = FALSE)
[1] FALSE

The reason can be found by inspecting the objects in more detail:

> dput(x)
structure(list(x = 1:3, y = 11:13), .Names = c("x", "y"), row.names = c(NA,
-3L), class = "data.frame")
> dput(y)
structure(list(x = 1:3, y = 11:13), .Names = c("x", "y"), row.names = c(NA,
3L), class = "data.frame")

Note the distinct row.names attributes:

> .row_names_info(x)
[1] -3
> .row_names_info(y)
[1] 3

From the documentation we can glean that a negative number implies automatic rownames (for x), whereas y’s row names aren’t automatic. And as.matrix treats them differently.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 2
    No disagreement. The help page for `row.names` says "Row names of the form 1:n for n > 2 are stored internally in a compact form,.." and that `as.matrix` and "other functions" will "handle [names of this sort] differently." Running trace('row.names') shows that it was called 3 times for the questioner's example (at least once there was a call to `print(y`)). It also says: "`row.names` will always return a character vector. (Use `attr(x, "row.names")` if you need to retrieve an integer-valued set of row names.)". – IRTFM Apr 04 '17 at 16:26
  • `row.names = c(NA,3L)` still produces row.names automatically as well as `row.names = c(NA,-3L)`. The question is, why subsetting the data.frame changes the sign (and consequently causes the difference) ? – digEmAll Apr 04 '17 at 21:56
  • 1
    @digEmAll : `c(NA, -3L)` seems to mark the object as not having explicit "row.names" (i.e. not set or set to `NULL`) which means, that a function accouning for a data.frame's "row.names" should ignore this attribute. `c(NA, 3L)` seems to mark the object as having explicit "row.names" but of the form `1:nrow(x)` which can be spared from creating. `"[.data.frame"` returns a subset of the data as well a subset of its "row.names" (e.g. `x[2:3, ]`'s "row.names" cannot be stored compactly), and it seems that the most consistent way to behave is to, always, return an object with explicit "row.names". – alexis_laz Apr 05 '17 at 00:05
3

As in comment, x and y are not strictly the same. When we call t to data.frame, t.data.frame will be executed:

function (x) 
{
    x <- as.matrix(x)
    NextMethod("t")
}

As we can see, it calls as.matrix, i.e. as.matrix.data.frame:

function (x, rownames.force = NA, ...) 
{
    dm <- dim(x)
    rn <- if (rownames.force %in% FALSE) 
        NULL
    else if (rownames.force %in% TRUE) 
        row.names(x)
    else if (.row_names_info(x) <= 0L) 
        NULL
    else row.names(x)
...

As commented by @oropendola, the return of .row_names_info of x and y are different and The above function is where the difference takes effect.

Then why y has different rownames? Let's look at [.data.frame, I have added comment at key lines:

{
    ... # many lines of code
    xx <- x  #!! this is where xx is defined
    cols <- names(xx)
    x <- vector("list", length(x))
    x <- .Internal(copyDFattr(xx, x))  # This is where I am not sure about
    oldClass(x) <- attr(x, "row.names") <- NULL
    if (has.j) {
        nm <- names(x)
        if (is.null(nm)) 
            nm <- character()
        if (!is.character(j) && anyNA(nm)) 
            names(nm) <- names(x) <- seq_along(x)
        x <- x[j]
        cols <- names(x)
        if (drop && length(x) == 1L) {
            if (is.character(i)) {
                rows <- attr(xx, "row.names")
                i <- pmatch(i, rows, duplicates.ok = TRUE)
            }
            xj <- .subset2(.subset(xx, j), 1L)
            return(if (length(dim(xj)) != 2L) xj[i] else xj[i, 
                                                            , drop = FALSE])
        }
        if (anyNA(cols)) 
            stop("undefined columns selected")
        if (!is.null(names(nm))) 
            cols <- names(x) <- nm[cols]
        nxx <- structure(seq_along(xx), names = names(xx))
        sxx <- match(nxx[j], seq_along(xx))
    }
    else sxx <- seq_along(x)
    rows <- NULL ## this is where rows is defined, as we give numeric i, the following
    ## if block will not be executed
    if (is.character(i)) {
        rows <- attr(xx, "row.names")
        i <- pmatch(i, rows, duplicates.ok = TRUE)
    }
    for (j in seq_along(x)) {
        xj <- xx[[sxx[j]]]
        x[[j]] <- if (length(dim(xj)) != 2L) 
            xj[i]
        else xj[i, , drop = FALSE]
    }
    if (drop) {
        n <- length(x)
        if (n == 1L) 
            return(x[[1L]])
        if (n > 1L) {
            xj <- x[[1L]]
            nrow <- if (length(dim(xj)) == 2L) 
                dim(xj)[1L]
            else length(xj)
            drop <- !mdrop && nrow == 1L
        }
        else drop <- FALSE
    }
    if (!drop) { ## drop is False for our case
        if (is.null(rows)) 
            rows <- attr(xx, "row.names")  ## rows changed from NULL to 1,2,3 here
        rows <- rows[i]
        if ((ina <- anyNA(rows)) | (dup <- anyDuplicated(rows))) {
            if (!dup && is.character(rows)) 
                dup <- "NA" %in% rows
            if (ina) 
                rows[is.na(rows)] <- "NA"
            if (dup) 
                rows <- make.unique(as.character(rows))
        }
        if (has.j && anyDuplicated(nm <- names(x))) 
            names(x) <- make.unique(nm)
        if (is.null(rows)) 
            rows <- attr(xx, "row.names")[i]
        attr(x, "row.names") <- rows  ## this is where the rownames of x changed
        oldClass(x) <- oldClass(xx)
    }
    x
}

we can see that y get its names by something like attr(x, 'row.names'):

> attr(x, 'row.names')
[1] 1 2 3

So when we created y with [.data.frame, it receives row.names attributes that are different from x, of which the row.names are automatic and indicated with negative sign in dput results.


edit

Actually, this has been stated in manual of row.names:

Note

row.names is similar to rownames for arrays, and it has a method that calls rownames for an array argument.

Row names of the form 1:n for n > 2 are stored internally in a compact form, which might be seen from C code or by deparsing but never via row.names or attr(x, "row.names"). Additionally, some names of this sort are marked as ‘automatic’ and handled differently by as.matrix and data.matrix (and potentially other functions).

So attr doesn't discriminate between automatic row.names (like that of x) and explicit interger row.names (like that of y), while this is discriminated by as.matrix through internal representation .row_names_info.

mt1022
  • 16,834
  • 5
  • 48
  • 71
  • 2
    Something worth noting is that `attr(x, "row.names")` and `attr(x, "row.names") = value` do not show how R, internally, handles the specific case of "row.names". `.row_names_info` is more accurate. E.g. `attr(x, "row.names") = 1:3` does not store `1:3` as "row.names" but as shown in `.row_names_info(x, 0)`. Still, though, anythign other than `NULL` flags the object as having user-defined "row.names" hence functions (like `as.matrix`) need to/should take this into account. – alexis_laz Apr 04 '17 at 16:05
  • sure. `attr(x, 'row.names')` and `attr(y, 'row.names')` gives the same results! – mt1022 Apr 04 '17 at 16:09