0

I wrote a small function that would count the number of NA, NaN an Inf in a tibble data frame as follows:

check.for.missing.values <- function(df) {
     return(  sum(is.na(as.matrix(df)) & !is.nan(as.matrix(df))) +    #NAs
              sum(is.infinite(as.matrix(df))) +                       #Infs
              sum(is.nan(as.matrix(df)))                              #NaNs
)}

I tested it with the following tibble:

x1 <- tibble(x = 1:7, 
             y = c(NA,NA,Inf,Inf,Inf,-Inf,-Inf), 
             z = c(-Inf,-Inf,NaN,NaN,NaN,NaN,NaN))
x1
# A tibble: 7 × 3
  x     y     z
<int> <dbl> <dbl>
  1     1    NA  -Inf
  2     2    NA  -Inf
  3     3   Inf   NaN
  4     4   Inf   NaN
  5     5   Inf   NaN
  6     6  -Inf   NaN
  7     7  -Inf   NaN`

And I get

check.for.missing.values(x1)
[1] 14

which of course is the correct answer.

Now, if the tibble that I pass on to the function happens to include observations in date format, then the functions stops working and I can't figure out why:

x2 <- mutate(x1, date = as.Date('01/07/2008','%d/%m/%Y'))
x2

# A tibble: 7 × 4
  x     y     z       date
<int> <dbl> <dbl>     <date>
  1     1    NA  -Inf 2008-07-01
  2     2    NA  -Inf 2008-07-01
  3     3   Inf   NaN 2008-07-01
  4     4   Inf   NaN 2008-07-01
  5     5   Inf   NaN 2008-07-01
  6     6  -Inf   NaN 2008-07-01
  7     7  -Inf   NaN 2008-07-01`

check.for.missing.values(x2)
[1] 7

Any clues as to what's going on?

Thanks

reyemarr

Mario Reyes
  • 385
  • 1
  • 2
  • 13

1 Answers1

3

As @nicola mentions, your issue is in the fact that you're converting the data frame to a matrix. In doing so, you force every "cell" to coerce to a single class, in this case that ends up being "character" class, and your Inf and -Inf are no longer caught by your function.

You can do what you're trying to do without resorting to the matrix conversion, by applying over the columns in the data frame. In your case, sapply will work.

check.for.missing.values <- function(df) {
    sum( sapply( df, function(x) {
        sum( { is.na(x) & !is.nan(x) } |
                 is.infinite(x) |
                 is.nan(x) )
    } ) )
}

sapply iterates over every column, adding up all the occurrences matching the set of given conditions. that returns a numeric vector, which can then be sumd again to get the total.

check.for.missing.values(x2)
[1] 14
rosscova
  • 5,430
  • 1
  • 22
  • 35
  • 1
    Or just `check.for.missing.values <- function(df) { x <- unlist(df) ; sum(c((is.na(x) & !is.nan(x)), is.infinite(x), is.nan(x))) }`. – ulfelder May 09 '17 at 12:21
  • @ulfelder interesting, I thought `unlist` would coerce to `character` just like `as.matrix`, but it goes to `numeric` instead. What's the difference between `as.matrix` and `unlist` causing that difference? – rosscova May 09 '17 at 12:24
  • 1
    Yeah, actually, I just tested my version on a tibble with strings, and then you're back to your original problem. I think it worked with the dates because they could be coerced to numeric, but otherwise `unlist` is going to present problems similar to `as.matrix`. So `sapply` is a more robust solution. – ulfelder May 09 '17 at 12:34
  • True. Still interested in why `unlist` and `as.matrix` coerce the same data frame to different classes though. – rosscova May 09 '17 at 12:36