0

I'm currently trying to impute the missing data through Gaussian mixture model. My reference paper is from here: http://mlg.eng.cam.ac.uk/zoubin/papers/nips93.pdf

I currently focus on bivariate dataset with 2 Gaussian components. This is the code to define the weight for each Gaussian component:

myData = faithful[,1:2];    # the data matrix  
for (i in (1:N)) {
        prob1 = pi1*dmvnorm(na.exclude(myData[,1:2]),m1,Sigma1);   # probabilities of sample points under model 1
        prob2 = pi2*dmvnorm(na.exclude(myData[,1:2]),m2,Sigma2);   # same for model 2
        Z<-rbinom(no,1,prob1/(prob1 + prob2 ))    # Z is latent variable as to assign each data point to the particular component 

        pi1<-rbeta(1,sum(Z)+1/2,no-sum(Z)+1/2)
        if (pi1>1/2) {
          pi1<-1-pi1
          Z<-1-Z
        }
      }

This is my code to define the missing values:

> whichMissXY<-myData[ which(is.na(myData$waiting)),1:2]
> whichMissXY
   eruptions waiting
11     1.833      NA
12     3.917      NA
13     4.200      NA
14     1.750      NA
15     4.700      NA
16     2.167      NA
17     1.750      NA
18     4.800      NA
19     1.600      NA
20     4.250      NA

My constraint is, how to impute the missing data in "waiting" variable based on particular component. This code is my first attempt to impute the missing data using conditional mean imputation. I know, it is definitely in the wrong way. The outcome would not lie to the particular component and produce outlier.

miss.B2 <- which(is.na(myData$waiting))
for (i in miss.B2) {
    myData[i, "waiting"] <- m1[2] + ((rho * sqrt(Sigma1[2,2]/Sigma1[1,1])) * (myData[i, "eruptions"] - m1[1] ) + rnorm(1,0,Sigma1[2,2]))
    #print(miss.B[i,])  
  }

I would appreciate if someone could give any advice on how to improve the imputation technique that could work with latent/hidden variable through Gaussian mixture model. Thank you in advance

Jas
  • 21
  • 5
  • This depends entirely on the covariance structure you assume for your mixture model. But the general process is to have two EM steps per iteration – alexwhitworth Dec 03 '16 at 17:43

1 Answers1

0

This is a solution for one type of covariance structure.

devtools::install_github("alexwhitworth/emclustr")
library(emclustr)
data(faithful)
set.seed(23414L)
ff <- apply(faithful, 2, function(j) {
  na_idx <- sample.int(length(j), 50, replace=F)
  j[na_idx] <- NA
  return(j)
})
ff2 <- em_clust_mvn_miss(ff, nclust=2)

# hmm... seems I don't return the imputed values. 
# note to self to update the code    
plot(faithful, col= ff2$mix_est)

enter image description here

And the parameter outputs

$it
[1] 27

$clust_prop
[1] 0.3955708 0.6044292

$clust_params
$clust_params[[1]]
$clust_params[[1]]$mu
[1]  2.146797 54.833431

$clust_params[[1]]$sigma
[1] 13.41944


$clust_params[[2]]
$clust_params[[2]]$mu
[1]  4.317408 80.398192

$clust_params[[2]]$sigma
[1] 13.71741
alexwhitworth
  • 4,839
  • 5
  • 32
  • 59