I am trying to create a comprehensive automated code for my team for missing value imputation using several different methods. I know the logic but I am having trouble in the data class identification which is important in deciding which method to chose for imputation.
The data that am working on looks like this:
Now, I want my code to identify the type of variables as:
- Categorical/Factor with multiple levels
- Factor with two levels 1 and 0(binary)
- Factor with two levels except 1 and 0, like 'yes' and 'no'
- Continuous
Here is the WIP code that I have but it isn't doing the job well and I understand the logic will fail given the data is different
data_type_vector<-function(x)
{
categorical_index<-character()
binary_index<-character()
continuous_index<-character()
binary_index_1<-character()
data<-x
for(a in 1:ncol(data)){
if(length(unique(data[,a])) >= 2 & length(unique(data[,a])) < 15 &
max(as.character(data[,a]),na.rm=T) != 1 & min(as.character(data[,a]),na.rm=T) !=0)
{
categorical_index<-c(categorical_index,colnames(data[a]))
} else if (max(as.character(data[,a]),na.rm=T) == 1 & min(as.character(data[,a],na.rm=T))==0) {
binary_index<-c(binary_index,colnames(data[a]))
} else if (length(unique(data[,a]))==2) {
#this basically defines categorical variables with two categories like male/female
#which don't have 1 0 values in the data but are still binary
#we are keeping them seperate for the purpose of further analysis
binary_index_1<-c(binary_index_1,colnames(data[a]))
} else
{
continuous_index<-c(continuous_index,colnames(data[a]))
}
}
assign("categorical_index",categorical_index,envir=globalenv())
assign("binary_index",binary_index,envir=globalenv())
assign("continuous_index",continuous_index,envir=globalenv())
assign("binary_index_1",binary_index_1,envir=globalenv())
}
I am trying to improve the logics that I have used to make it generic so that others can use it but I have kind of hit a wall here. Appreciate any help.