I'm new to writing functions, am sure this is a simple one. I have a 111 col X ~10,500 row df with all missing values coded as <NA>
. Intuitively, I need a function that does the following column-wise over a dataframe
:
ifelse(length(is.na(colx) > length(colx)/5, NULL, colx)
i.e. I need to drop any variables with more than 1/5 (20%) missing values. Thanks to all for indicating there's a similar answer, i.e. using
colMeans(is.na(mydf)) > .20
to ID the columns, but this doesn't fully answer my question.
The above code returns a logical vector indicating the variables to be dropped. I have more than 100 variables with complex names and picking through them to drop by hand is tedious and bound to introduce errors. How can I modify the above, or use some version of my original proposed ifelse, to only return a new dataframe of columns with < 20% NA, as I asked originally?
Thanks!!