4

The Problem

This a simple tapply example:

z=data.frame(s=as.character(NA), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
tapply(z$s, list(z$rows, z$cols), identity) 

On R (Another Canoe) v3.3.3 (2017-03-06) for Windows, it brings:

#   1  2 
# 1 NA NA
# 2 NA NA

On R (You Stupid Darkness) v3.4.0 (2017-04-21) for Windows, it brings:

#   1  2 
# 1 NA NA
# 2 NA ""

R News References

According to NEWS.R-3.4.0.:

tapply() gets new option default = NA allowing to change the previously hardcoded value.

In this instance instead, it seems like if it defaults to an empty string.

Inconsistencies Among Data Types

The new behavior is inconsistent with the numeric or logical version, where one still gets all NAs:

z=data.frame(s=as.numeric(NA), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
tapply(z$s, list(z$rows, z$cols), identity)

#    1  2
# 1 NA NA
# 2 NA NA

The same is for s=NA, which means s=as.logical(NA).

An Even Worse Case

In a more realistic context the character vector s in z has several values including NAs.

z=data.frame(s=c('a', NA, 'c'), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
m=tapply(z$s, list(z$rows, z$cols), identity)
z;m

#      s rows cols
# 1    a    1    1
# 2 <NA>    2    1
# 3    c    1    2

#   1   2  
# 1 "a" "c"
# 2 NA  "" 

In general, we might fix this setting missing values for combinations with no values:

m[!nzchar(m)]=NA; m
#   1   2  
# 1 "a" "c"
# 2 NA  NA 

Now when there is no value, such as in (2,2), one correctly gets a NA, like in the old versions. But what if the input of tapply already has some empty strings?

z=data.frame(s=c('a', NA, ''), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
m=tapply(z$s, list(z$rows, z$cols), identity)
z;m

#      s rows cols
# 1    a    1    1
# 2 <NA>    2    1
# 3         1    2

#   1   2 
# 1 "a" ""
# 2 NA  ""

Now there is no way to distinguish between the legal empty string in (1,2) and that artificially added in (2,2) in place of the NA by the new tapply. So we can't apply the fix.

Questions

Is really the new behavior the correct one? That is, if there is no string for rows=2 and cols=2, why this is not reported as a missing value (NA) and why this is so only for character data types?

Can we rewrite the code above in such a way to get a consistent behavior across R versions?

antonio
  • 10,629
  • 13
  • 68
  • 136
  • If you change it to `z <- data.frame(s = NA, ...` it would give NA – akrun May 09 '17 at 12:04
  • @beetroot: right you are. I updated the question – antonio May 09 '17 at 12:15
  • @akrun: `s=as.character(NA)` is just intended to help understanding where the problem is, i.e. in character NAs. In practice, it would be something like `s=c('a', NA, 'c')`. – antonio May 09 '17 at 12:24
  • It seems that the issue can be isolated to the fact that `array(character(), c(1, 1))` (something similar is used to initialize the returned value inside `tapply`) does not return `NA_character_` but `""` _in contrast_ to the "Value" section of `?array`. – alexis_laz May 09 '17 at 13:32

0 Answers0