-2

I am trying to create a comprehensive automated code for my team for missing value imputation using several different methods. I know the logic but I am having trouble in the data class identification which is important in deciding which method to chose for imputation.

The data that am working on looks like this: enter image description here

Now, I want my code to identify the type of variables as:

  1. Categorical/Factor with multiple levels
  2. Factor with two levels 1 and 0(binary)
  3. Factor with two levels except 1 and 0, like 'yes' and 'no'
  4. Continuous

Here is the WIP code that I have but it isn't doing the job well and I understand the logic will fail given the data is different

data_type_vector<-function(x)
{
  categorical_index<-character()
  binary_index<-character()
  continuous_index<-character()
  binary_index_1<-character()

  data<-x

  for(a in 1:ncol(data)){

if(length(unique(data[,a])) >= 2 & length(unique(data[,a])) < 15 & 
   max(as.character(data[,a]),na.rm=T) != 1 & min(as.character(data[,a]),na.rm=T) !=0)
{

  categorical_index<-c(categorical_index,colnames(data[a]))

} else if (max(as.character(data[,a]),na.rm=T) == 1 & min(as.character(data[,a],na.rm=T))==0) {

  binary_index<-c(binary_index,colnames(data[a]))

} else if (length(unique(data[,a]))==2) {

  #this basically defines categorical variables with two categories like male/female
  #which don't have 1 0 values in the data but are still binary
  #we are keeping them seperate for the purpose of further analysis

  binary_index_1<-c(binary_index_1,colnames(data[a]))

} else

{
  continuous_index<-c(continuous_index,colnames(data[a]))
}

}

assign("categorical_index",categorical_index,envir=globalenv())
assign("binary_index",binary_index,envir=globalenv())
assign("continuous_index",continuous_index,envir=globalenv())
assign("binary_index_1",binary_index_1,envir=globalenv())
}

I am trying to improve the logics that I have used to make it generic so that others can use it but I have kind of hit a wall here. Appreciate any help.

Hack-R
  • 22,422
  • 14
  • 75
  • 131
Ranjan Pandey
  • 85
  • 2
  • 11

1 Answers1

0

This can be done by checking the number of levels and the levels themselves. categorize is the generic that invokes categorize.data.frame if given a data.frame. It in turn invokes categorize.default for each column. categorize can also directly be called on a column.

The way it works is that it computes the number of levels except if there are three or more it uses 3 and it adds on 2 if the levels are "0" and "1". This gives us a number between 0 and 4 inclusive. Then we set up a factor with meaningful level names.

Note that anything that is not a factor will be identified as "continuous". For example, as implied by the question, a column containing just 0's and 1's is continuous as it is not a factor.

categorize <- function(x, ...) UseMethod("categorize")

categorize.data.frame <- function(x, ...) sapply(x, categorize)

categorize.default <- function(x, ...) {
   factor(min(nlevels(x), 3) + 2*identical(levels(x), c("0", "1")), levels = 0:4, 
    labels = c("continuous", "factor1", "factor2", "factor", "zero-one"))
}

Now test it out:

DF <- data.frame(a = factor(c(0, 1, 0)), b = factor(c("male", "female", "male")), 
         c = factor(1:3), d = 1:3)

categorize(DF)
##          a          b          c          d 
##   zero-one    factor2     factor continuous 
## Levels: continuous factor1 factor2 factor zero-one

categorize(DF$a)
## [1] zero-one
## Levels: continuous factor1 factor2 factor zero-one

categorize(0:1)
## [1] continuous
## Levels: continuous factor1 factor2 factor zero-one

Note: Since what is being asked for is close to just asking for the number of levels, an alternative might be to just return the number of levels and use -2 to mean a binary factor with "0", "1" levels. That is,

categorize.default <- function(x, ...) nlevels(x) - 4 * identical(levels(x), c("0", "1"))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • That is a really good explanation and it makes total sense. So, if I use `categorize.default <- function(x, ...) nlevels(x) - 4 * identical(levels(x), c("0", "1"))` these are the values and meanings: _-2 : Binary(0 and 1)_,_0 : Continuous_,_1 : factor1_,_2 : factor2_,_>=3 : factor with multiple levels_. Am I correct? – Ranjan Pandey Sep 26 '16 at 06:17
  • If n is the value of the formula in your comment then n = 2 means 2 levels, n = 3 means 3 levels, n = 4 levels, etc. Use the `min(nlevels(x), 3) - 4 * identical(levels(x), c("0", "1"))` if you want 3 to mean 3 or more levels but I am not so sure there is really any advantage in cutting it off at 3 like that. – G. Grothendieck Sep 26 '16 at 06:49
  • That makes sense, thank you so much for the solution. – Ranjan Pandey Sep 26 '16 at 07:01