0

I'm new to writing functions, am sure this is a simple one. I have a 111 col X ~10,500 row df with all missing values coded as <NA>. Intuitively, I need a function that does the following column-wise over a dataframe:

ifelse(length(is.na(colx) > length(colx)/5, NULL, colx)

i.e. I need to drop any variables with more than 1/5 (20%) missing values. Thanks to all for indicating there's a similar answer, i.e. using

colMeans(is.na(mydf)) > .20

to ID the columns, but this doesn't fully answer my question.

The above code returns a logical vector indicating the variables to be dropped. I have more than 100 variables with complex names and picking through them to drop by hand is tedious and bound to introduce errors. How can I modify the above, or use some version of my original proposed ifelse, to only return a new dataframe of columns with < 20% NA, as I asked originally?

Thanks!!

jlev514
  • 281
  • 5
  • 15
  • 7
    Check out http://stackoverflow.com/questions/11821303/deleting-columns-from-a-data-frame-where-na-is-more-than-15-of-the-column-lengt – lukeA Oct 05 '15 at 09:24
  • This should work: `mydf <- mydf[ , colMeans(is.na(mydf)) > 0.2 ]` , you can supply logical vector to choose columns to keep (TRUE) and to drop (FALSE). – zx8754 Oct 05 '15 at 10:03
  • 1
    @zx8754 thanks for this, works great! Also did find similar answer in the duplicate post: `final <- mydf[, colMeans(is.na(mydf)) <= .20]` – jlev514 Oct 05 '15 at 10:09
  • @jlev514 you can close as duplicate your own post. – zx8754 Oct 05 '15 at 10:09

2 Answers2

1

One way of doing this (probably not the shortest) is to iterate over the lines of the data.frame with by and then rbinding the result together to one data.frame.

Just change the condition in the if in the code below, here line with at least one NA value are removed.

do.call(rbind, by(your.dataset, 
              1:nrow(your.dataset), 
              FUN=function(x){
                if(sum(is.na(x))==0){ 
                    return(x) 
                  } else { 
                    return(NULL)} 
                }))
snaut
  • 2,261
  • 18
  • 37
0

When you use lapply on a data.frame, it performs the given function on each column as if each were a list.

So if f is your function for "processing" a column, you should use:

lapply(df, f)

vapply should be used when the result will always be a vector of a known size.

sapply is like an automatic vapply. It tries to simplify the result to a vector. I would advise against using sapply, except for exploratory programming.


(Updated to reflect edit)

Try:

f <- function(x) {
    sum(is.na(x)) < length(x) * 0.2
}

df[, vapply(df, f, logical(1)), drop = F]
Community
  • 1
  • 1
sdgfsdh
  • 33,689
  • 26
  • 132
  • 245
  • The question has been updated now, but the questioner was asking for a way to bind `x` in `function(x)` to a column. `lapply` does exactly that. – sdgfsdh Oct 05 '15 at 09:56