R function to remove variables with more than 20% (NOT just ID them)?

Question

I'm new to writing functions, am sure this is a simple one. I have a 111 col X ~10,500 row df with all missing values coded as <NA>. Intuitively, I need a function that does the following column-wise over a dataframe:

ifelse(length(is.na(colx) > length(colx)/5, NULL, colx)

i.e. I need to drop any variables with more than 1/5 (20%) missing values. Thanks to all for indicating there's a similar answer, i.e. using

colMeans(is.na(mydf)) > .20

to ID the columns, but this doesn't fully answer my question.

The above code returns a logical vector indicating the variables to be dropped. I have more than 100 variables with complex names and picking through them to drop by hand is tedious and bound to introduce errors. How can I modify the above, or use some version of my original proposed ifelse, to only return a new dataframe of columns with < 20% NA, as I asked originally?

Thanks!!

Check out http://stackoverflow.com/questions/11821303/deleting-columns-from-a-data-frame-where-na-is-more-than-15-of-the-column-lengt — lukeA, Oct 05 '15 at 09:24
This should work: `mydf <- mydf[ , colMeans(is.na(mydf)) > 0.2 ]` , you can supply logical vector to choose columns to keep (TRUE) and to drop (FALSE). — zx8754, Oct 05 '15 at 10:03
@zx8754 thanks for this, works great! Also did find similar answer in the duplicate post: `final <- mydf[, colMeans(is.na(mydf)) <= .20]` — jlev514, Oct 05 '15 at 10:09

score 1 · Answer 1 · answered Oct 05 '15 at 10:04

One way of doing this (probably not the shortest) is to iterate over the lines of the data.frame with by and then rbinding the result together to one data.frame.

Just change the condition in the if in the code below, here line with at least one NA value are removed.

do.call(rbind, by(your.dataset, 
              1:nrow(your.dataset), 
              FUN=function(x){
                if(sum(is.na(x))==0){ 
                    return(x) 
                  } else { 
                    return(NULL)} 
                }))

Is it one *column* per variable, or one *row*? – sdgfsdh Oct 05 '15 at 10:06 — sdgfsdh, Oct 05 '15 at 10:06

score 0 · Answer 2 · edited May 23 '17 at 11:44

0

When you use lapply on a data.frame, it performs the given function on each column as if each were a list.

So if f is your function for "processing" a column, you should use:

lapply(df, f)

vapply should be used when the result will always be a vector of a known size.

sapply is like an automatic vapply. It tries to simplify the result to a vector. I would advise against using sapply, except for exploratory programming.

(Updated to reflect edit)

Try:

f <- function(x) {
    sum(is.na(x)) < length(x) * 0.2
}

df[, vapply(df, f, logical(1)), drop = F]

edited May 23 '17 at 11:44

Community

1
1

answered Oct 05 '15 at 09:29

sdgfsdh

33,689
26
132
245

The question has been updated now, but the questioner was asking for a way to bind `x` in `function(x)` to a column. `lapply` does exactly that. – sdgfsdh Oct 05 '15 at 09:56

R function to remove variables with more than 20% (*NOT* just ID them)?

2 Answers2

R function to remove variables with more than 20% (NOT just ID them)?