I have a dataset that contains binary, categorical columns but coded as discreet numeric and continous features. I am trying to build a function that finds out the column indexes that does not contain numeric values.
An example dataset is given below:
data <- data.frame(var1=c(rep(1,5),rep(0,5)),var2=c(rep(0,2),rep(1,8)),
var3=c(1,2,3,4,4,2,3,1,1,2), var4=rnorm(10),
var5=as.numeric(c(rnorm(6),rep("NA",4))))
var1 var2 var3 var4 var5
1 1 0 1 0.7312777 -1.3902633
2 1 0 2 0.5120417 -1.2470914
3 1 1 3 1.6502341 -0.9980822
4 1 1 4 0.4298987 0.7766762
5 1 1 4 -0.8025510 -0.5221676
6 0 1 2 0.2001818 -1.2300872
7 0 1 3 -0.5521180 NA
8 0 1 1 -1.7895327 NA
9 0 1 1 -0.5309557 NA
10 0 1 2 -1.7362210 NA
I have tried the following function so far:
is.binary <- function(v) {
x <- unique(v)
length(x) - sum(is.na(x)) == 2L && all(x[1:2] == 0:1)
}
This function does detect columns that have only 2 values (1,0) even if they contain "NA" but this function does not detect the binary or categorical columns correctly. when I ran the function using command:
vapply(data, is.binary, logical(1))
the result was
var1 var2 var3 var4 var5
FALSE TRUE FALSE FALSE FALSE
Whereas, I am looking for it to identify the first 3 columns as binary/categorical somehow.