How to find missing values?

Question

What are the techniques (such as KNN, Max likelihood) that I can use to find the missing values? I want to use R and trying to find a suitable technique to impute the missing values.

The sample data is shown below:

F1  F2  F3  F4  F5  Class
Good    20  5   7   Old Normal
Good    Missing 8   8   Old Normal
Good    15  10  10  Old Normal
Good    50  10  10  Old Normal
Good    70  10  10  Old Abnormal
Bad 20  5   7   Old Abnormal
Good    20  5   80  Old Abnormal
Good    85  100 100 Old Abnormal
Good    20  100 Missing Old Abnormal
Good    24  6   8.4 Old Normal
Good    12  9.6 9.6 Old Normal
Good    18  12  12  Old Normal
Good    60  12  12  Old Normal
Good    84  Missing 12  Old Abnormal
Bad 24  6   8.4 Old Abnormal
Good    24  6   96  Old Abnormal
Good    102 120 120 Old Abnormal
Good    24  120 72  Old Abnormal

See packages mice and Amelia and their vignettes and references. — Roland, Apr 04 '17 at 06:59
In the function `read.table()` you can set `na.strings = "Missing"` — jogo, Apr 04 '17 at 07:00

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Here are couple of codes that can help you with the analysis

If data has any NA or not

any(is.na(..name of data..))

Visualizing missing data

require(VIM)
aggr(..name of data..,plot = TRUE,bars=TRUE)

calculate percentage of NAs

create a simple function

propmiss <- function(dataframe) lapply(dataframe,function(x) data.frame(nmiss=sum(is.na(x)), n=length(x), propmiss=sum(is.na(x))/length(x)))

propmiss(..name of the data..)

Removing rows with more then 50 % values ( similar function for columns )

sparse.rows = c()
for (i in 1:nrow(clust.datatrain))  {
  if (sum(length(which(is.na(clust.datatrain[i,])))) > 0.5*ncol(clust.datatrain))  {
    sparse.rows = c(sparse.rows,i)
  }
}
length(sparse.rows)  #25
clust.datatrain = clust.datatrain[-sparse.rows,]

imputation

KNN

require(DMwR)
train.1=knnImputation(clust.datatrain, k = 10, scale = T, meth = "weighAvg",
                      distData = NULL)

Using MICE ( multiple methods ) example of Bayesian linear regression below

require(mice)
xdash=mice(datafile,m=5,maxit=50,meth='norm',seed=500)
completedata=complete(xdash,1)
completedata

This all should be good for analysis and imputation !