Viewing all column names with any NA in R

Question

I need to get the name of the columns that have at least 1 NA.

df<-data.frame(a=1:3,b=c(NA,8,6), c=c('t',NA,7))

I need to get "b, c".

I found this code:

sapply(df, function(x) any(is.na(x)))

But I need only the variables that have any NA.

I tried this:

sapply(df, function(x) colnames(df[,any(is.na(x))]))

But I get all the column names.

score 10 · Answer 1 · edited Jan 14 '17 at 19:17

10

You were very close. Your first try yields a boolean vector, which you can use to index the names of df:

contains_any_na = sapply(df, function(x) any(is.na(x)))
names(df)[contains_any_na]
# [1] "b" "c"

Update Jan 14, 2017: As of R version 3.1.0, anyNA() can be used as an alternative to any(is.na(.)), and the above code can be simplified to

names(df)[sapply(df, anyNA)]
# [1] "b" "c"

edited Jan 14 '17 at 19:17

Rich Scriven

97,041
11
181
245

answered Sep 28 '14 at 13:37

Paul Hiemstra

59,984
12
142
149

agstudy · Accepted Answer · 2014-09-28T13:45:20.817

9

Another acrobatic solution (just for fun) :

colnames(df)[!complete.cases(t(df))]
[1] "b" "c"

The idea is : Getting the columns of A that have at least 1 NA is equivalent to get the rows that have at least NA for t(A). complete.cases by definition (very efficient since it is just a call to C function) gives the rows without any missing value.

edited Sep 28 '14 at 13:45

answered Sep 28 '14 at 13:38

agstudy

119,832
17
199
261

@Arun of course..see my edit(performance explanation) in akrun answer. – agstudy Oct 01 '14 at 14:47
1

@Arun because the performance is done by akrun and I begin my answer by saying "acrobatic solution" , The OP accpted it I don't know why (reputation baisis) :) but feel free to edit it! – agstudy Oct 01 '14 at 14:53

score 6 · Answer 3 · edited Oct 01 '14 at 14:37

6

Try the data.table version:

library(data.table)
setDT(df)
names(df)[df[,sapply(.SD, function(x) any(is.na(x))),]]
[1] "b" "c"

Microbenchmarking using @akrun's code:

set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
setDT(df1)


f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x)))
           names(df1)[contains_any_na]}

f2 <- function() {colnames(df1)[!complete.cases(t(df1))] }
f3 <- function() { names(df1)[!!colSums(is.na(df1))] }

f4 <- function() { names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]] }

microbenchmark(f1(), f2(), f3(), f4(), unit="relative")   
# Unit: relative
#  expr       min        lq    median       uq      max neval
#  f1()  1.000000  1.000000  1.000000 1.000000 1.000000   100
#  f2() 10.459124 10.928821 10.955986 9.858967 7.069066   100
#  f3()  3.323144  3.805183  4.159624 3.775549 2.797329   100
#  f4() 10.108998 10.242207 10.121022 9.117067 6.576976   100

@agstudy : This solution is similar in speed to colnames(df1)[!complete.cases(t(df1))].

edited Oct 01 '14 at 14:37

Arun

116,683
26
284
387

answered Sep 28 '14 at 14:59

rnso

23,686
25
112
234

this is will be the slowest solution :) – agstudy Sep 28 '14 at 15:01
@rnso I was just joking, my remark was just to say that not because you use data.table you will have the fastest solution. +1 for benchmarking for you and akrun . – agstudy Sep 28 '14 at 17:32
1

The time taken here is due to `.SD` being a *deep* copy of `df1`. This could be avoided to improve speed in operations without `by`. See [#838](https://github.com/Rdatatable/data.table/issues/838). – Arun Oct 01 '14 at 14:47
1

The benchmarking makes this a phenomenal answer. – Dale Kube Aug 31 '21 at 13:49

akrun · Answer 4 · 2014-09-28T16:38:15.633

5

 names(df)[!!colSums(is.na(df))]
 #[1] "b" "c"

Explanation

colSums(is.na(df)) #gives you the number of missing value per each columns
#a b c 
#0 1 1

By using !, we are creating a logical index

!colSums(is.na(df))   #here the value of `0` will be `TRUE` and all other values `>0` FALSE
 #   a     b     c 
 #TRUE FALSE FALSE

But, we need to select those columns that have atleast one NA, so ! negate again

!!colSums(is.na(df))
#   a     b     c 
#FALSE  TRUE  TRUE

and use this logical index to get the colnames that have at least one NA

Benchmarks

 set.seed(49)
 df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))

 library(microbenchmark)

 f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x)))
            names(df1)[contains_any_na]}

 f2 <- function() {colnames(df1)[!complete.cases(t(df1))] }
 f3 <- function() { names(df1)[!!colSums(is.na(df1))] }

 microbenchmark(f1(), f2(), f3(), unit="relative")
 #Unit: relative
 #expr      min       lq   median       uq      max neval
 #f1() 1.000000 1.000000 1.000000 1.000000 1.000000   100
 #f2() 8.921109 7.289053 6.852122 6.210826 4.889684   100
 #f3() 3.248072 3.105798 2.984453 2.774513 2.599745   100

EDIT performance explanation:

Maybe surprising sapply based solution is the winner here because as noted in @flodel comment below , the 2 others solutions created a matrix behind the scene (t(df) and is.na(df)) create matrix.

edited Sep 28 '14 at 16:38

answered Sep 28 '14 at 13:32

akrun

874,273
37
540
662

Could you provide some more detail why your solution works, for example that taking the sum of a logical vector yields the number of TRUE values, that `!` means `NOT`. This would make the answer more useful for the OP and other people. – Paul Hiemstra Sep 28 '14 at 13:38
@Paul Hiemstra Thanks for the comment. I added some explanation. – akrun Sep 28 '14 at 13:43
Thanks but why I can't do it with names(colSums(is.na(df))>0)? – GabyLP Sep 28 '14 at 13:45
but I get:> names(colSums(is.na(df))>0) [1] "a" "b" "c" – GabyLP Sep 28 '14 at 13:49
@GabyP I would say even taht : `!!` is more elegant but `>0` is more readable. – agstudy Sep 28 '14 at 13:49
@akrun amazing benchmarking... I expected that `colSums` is faster than my transpose based solution , but having `sapply` as the fastest this is a real surprise! – agstudy Sep 28 '14 at 14:42
@akrun I guess that `sapply` is the way to go when you want to perform a solution over columns. – agstudy Sep 28 '14 at 14:47
1

Your two solutions are slower because `t(df)` and `is.na(df)` create matrices. – flodel Sep 28 '14 at 14:52

score 0 · Answer 5 · answered Jan 14 '17 at 19:07

A simple one liner for this is :

colnames(df[,sapply(df, function(x) any(is.na(x)))])

Explanation:

sapply(df, function(x) any(is.na(x)))

returns True/False for columns with atleast 1 NA. df[,sapply(df, function(x) any(is.na(x)))] gets the subset of dataframe that has all its columns with atleast 1 NA. And colnames gives the names of those columns.

Viewing all column names with any NA in R

5 Answers5

Explanation

Benchmarks

EDIT performance explanation:

Linked