2

I have a dataset that contains binary, categorical columns but coded as discreet numeric and continous features. I am trying to build a function that finds out the column indexes that does not contain numeric values.

An example dataset is given below:

data <- data.frame(var1=c(rep(1,5),rep(0,5)),var2=c(rep(0,2),rep(1,8)),
  var3=c(1,2,3,4,4,2,3,1,1,2), var4=rnorm(10),
  var5=as.numeric(c(rnorm(6),rep("NA",4))))

  var1 var2 var3       var4       var5
1     1    0    1  0.7312777 -1.3902633
2     1    0    2  0.5120417 -1.2470914
3     1    1    3  1.6502341 -0.9980822
4     1    1    4  0.4298987  0.7766762
5     1    1    4 -0.8025510 -0.5221676
6     0    1    2  0.2001818 -1.2300872
7     0    1    3 -0.5521180         NA
8     0    1    1 -1.7895327         NA
9     0    1    1 -0.5309557         NA
10    0    1    2 -1.7362210         NA

I have tried the following function so far:

is.binary <- function(v) {
  x <- unique(v)
  length(x) - sum(is.na(x)) == 2L && all(x[1:2] == 0:1)
}

This function does detect columns that have only 2 values (1,0) even if they contain "NA" but this function does not detect the binary or categorical columns correctly. when I ran the function using command:

vapply(data, is.binary, logical(1))

the result was

var1  var2  var3  var4  var5 
FALSE  TRUE FALSE FALSE FALSE 

Whereas, I am looking for it to identify the first 3 columns as binary/categorical somehow.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
syebill
  • 543
  • 6
  • 23
  • 1
    first, a tip: just use `var5=c(rnorm(6),rep(NA,4)))` – MichaelChirico Aug 03 '15 at 15:16
  • I did and it worked on var5 correctly as it identified var5 as not a binary column. – syebill Aug 03 '15 at 15:27
  • If I use is.binary <- function(v) { x <- unique(v) length(x) - sum(is.na(x)) == 2L }, Then it detects the binary columns correctly. How can it be modified to detect the third column as binary as well? Also if there are missing values in a binary column then this function does not detect it as binary. – syebill Aug 03 '15 at 15:35
  • That worked on the example dataframe. Thanks. For understanding, is it that floor only works with whole values? I guess this needs to be manually done for a dataset because if I introduce a column which is numeric but contains integer values without decimals then this column will be detected as binary as per the function. If all the numeric attributes are normalized then this function works perfectly. Thanks for your help and any suggestions welcome. – syebill Aug 03 '15 at 15:49

2 Answers2

2

You check to see if the difference between the numbers and floor(numbers) (or trunc/ceiling) is numerically insignificant with all.equal

sapply(data, function(x) isTRUE(all.equal(x, floor(x))))
#  var1  var2  var3  var4  var5 
#  TRUE  TRUE  TRUE FALSE FALSE 

For, binary, your could further check that length(unique(trunc(numbers)))==2L

Rorschach
  • 31,301
  • 5
  • 78
  • 129
1

Using data.table, which has the convenient uniqueN function

library(data.table)
setDT(data) # convert to data.table, no copies
data[!is.na(x), sapply(.SD, uniqueN) <= 2L]
# var1 var2  var3  var4  var5
# TRUE TRUE FALSE FALSE FALSE

In base R you can use:

sapply(data, function(x) length(unique(na.omit(x))) <= 2L)
# var1  var2  var3  var4  var5 
# TRUE  TRUE FALSE FALSE FALSE 
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Can this function detect binary columns even if they have "NA's"? Where can I find this "uniqueN" function? I have checked the vignette of data.table but could not see it there. How can it be modified to detect the third column as "TRUE" as that column is also categorical but coded as numeric? Thanks. – syebill Aug 03 '15 at 15:43