The Problem
This a simple tapply
example:
z=data.frame(s=as.character(NA), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
tapply(z$s, list(z$rows, z$cols), identity)
On R (Another Canoe) v3.3.3 (2017-03-06) for Windows, it brings:
# 1 2
# 1 NA NA
# 2 NA NA
On R (You Stupid Darkness) v3.4.0 (2017-04-21) for Windows, it brings:
# 1 2
# 1 NA NA
# 2 NA ""
R News References
According to NEWS.R-3.4.0.:
tapply()
gets new optiondefault = NA
allowing to change the previously hardcoded value.
In this instance instead, it seems like if it defaults to an empty string.
Inconsistencies Among Data Types
The new behavior is inconsistent with the numeric or logical version, where one still gets all NAs:
z=data.frame(s=as.numeric(NA), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
tapply(z$s, list(z$rows, z$cols), identity)
# 1 2
# 1 NA NA
# 2 NA NA
The same is for s=NA
, which means s=as.logical(NA)
.
An Even Worse Case
In a more realistic context the character vector s
in z
has several values including NAs.
z=data.frame(s=c('a', NA, 'c'), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
m=tapply(z$s, list(z$rows, z$cols), identity)
z;m
# s rows cols
# 1 a 1 1
# 2 <NA> 2 1
# 3 c 1 2
# 1 2
# 1 "a" "c"
# 2 NA ""
In general, we might fix this setting missing values for combinations with no values:
m[!nzchar(m)]=NA; m
# 1 2
# 1 "a" "c"
# 2 NA NA
Now when there is no value, such as in (2,2)
, one correctly gets a NA
, like in the old versions.
But what if the input of tapply
already has some empty strings?
z=data.frame(s=c('a', NA, ''), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
m=tapply(z$s, list(z$rows, z$cols), identity)
z;m
# s rows cols
# 1 a 1 1
# 2 <NA> 2 1
# 3 1 2
# 1 2
# 1 "a" ""
# 2 NA ""
Now there is no way to distinguish between the legal empty string in (1,2)
and that artificially added in (2,2)
in place of the NA
by the new tapply. So we can't apply the fix.
Questions
Is really the new behavior the correct one?
That is, if there is no string for rows=2
and cols=2
, why this is not reported as a missing value (NA
) and why this is so only for character data types?
Can we rewrite the code above in such a way to get a consistent behavior across R versions?