1

The following operation works fine :

df1 <- data.frame(a= 1, b = 2) 
cbind(df1, c=3:4)
#>   a b c
#> 1 1 2 3
#> 2 1 2 4

However if I subset df1, even keeping it identical, I get a warning :

df2 <- df1[1,]
identical(df1, df2)
#> [1] TRUE
cbind(df2, c=3:4)
#>   a b c
#> 1 1 2 3
#> 2 1 2 4

Warning in data.frame(..., check.names = FALSE): row names were found from a short variable and have been discarded

I haven't set any row names, and these are supposed to be identical, what is happening ?

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • 1
    If you check the row.names, it is different `row.names = c(NA, -1L)` and in second it is `row.names = 1L` – akrun Feb 08 '20 at 23:52

1 Answers1

2

identical() doesn't always tell the full story by default :

identical(df1, df2, attrib.as.set = FALSE)
#> [1] FALSE

This option compares the attributes more rigorously, though here if we look at them, we can only see a difference in order, which is not the reason of the observed behavior as we'll see.

attributes(df1)
#> $names
#> [1] "a" "b"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1
attributes(df2)
#> $names
#> [1] "a" "b"
#> 
#> $row.names
#> [1] 1
#> 
#> $class
#> [1] "data.frame"

We can try to use row.names() but it won't help, more info however can be displayed using .row_names_info() and dput() :

row.names(df1) # sneaky snake!
#> [1] "1"
row.names(df2)
#> [1] "1"
.row_names_info(df1)
#> [1] -1
.row_names_info(df2)
#> [1] 1

dput(df1)
#> structure(list(a = 1, b = 2), class = "data.frame", row.names = c(NA, 
#> -1L))
dput(df2)
#> structure(list(a = 1, b = 2), row.names = 1L, class = "data.frame")

In fact, cbind.data.frame() calls data.frame() which itself calls .row_names_info() and tests its sign before triggering the warning, and .row_names_info(df1) is negative while .row_names_info(df2) is positive.

Setting row names to NULL "reinitializes" the row.names.

row.names(df2) <- NULL
cbind(df2, c=3:4)
#>   a b c
#> 1 1 2 3
#> 2 1 2 4

So what the warning in essence was saying is that we were trying to recycle the rows of a data.frame that had row names, so row names had to be ignored for the recycling to happen. If row names are really inexistent, the recycling happens silently.

Now we can argue on the definition and relevance of what is "having no row names" here.

I'm aware that this answer doesn't answer everything (not all the how, and not any of the why), but that's all I have!

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167