3

I fit some data to a mixture distribution of two gaussian in flexmix:

data("NPreg", package = "flexmix")
mod <- flexmix(yn ~ x, data = NPreg, k = 2,
           model = list(FLXMRglm(yn ~ x, family= "gaussian"),
                        FLXMRglm(yn ~ x, family = "gaussian")))

the model fit is as follows:

> mod

Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2, model =    list(FLXMRglm(yn ~ x, family = "gaussian"), 
    FLXMRglm(yn ~ x, family = "gaussian")))

Cluster sizes:
  1   2 
 74 126 

convergence after 31 iterations

But how do I actually predict from this model?

when I do

pred <- predict(mod, NPreg)

I get a list with the predictions from each of the two components

To get a single prediction, do I have to add in the cluster sizes like this?

single <- (74/200)* pred$Comp.1[,1] + (126/200)*pred$Comp.2[,2]
spore234
  • 3,550
  • 6
  • 50
  • 76

2 Answers2

8

I use flexmix for prediction in the following way:

pred = predict(mod, NPreg)
clust = clusters(mod,NPreg)
result = cbind(NPreg,data.frame(pred),data.frame(clust))
plot(result$yn,col = c("red","blue")[result$clust],pch = 16,ylab = "yn")

Clusters in NPreg

And the confusion matrix:

table(result$class,result$clust)

Confusion Matrix for NPreg

For getting the predicted values of yn, I select the component value of the cluster to which a data point belongs.

for(i in 1:nrow(result)){
  result$pred_model1[i] = result[,paste0("Comp.",result$clust[i],".1")][i]
  result$pred_model2[i] = result[,paste0("Comp.",result$clust[i],".2")][i]
}

The actual vs predicted results show the fit (adding only one of them here as both of your models are same, you would use pred_model2 for the second model).

qplot(result$yn, result$pred_model1,xlab="Actual",ylab="Predicted") + geom_abline()

Actual Vs Predicted

RMSE = sqrt(mean((result$yn-result$pred_model1)^2))

gives a root mean square error of 5.54.

This answer is based on many SO answers I read through while working with flexmix. It worked well for my problem.

You may also be interested in visualizing the two distributions. My model was the following, which shows some overlap as the ratio of components are not close to 1.

Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2, 
model = list(FLXMRglm(yn ~ x, family = "gaussian"), 
             FLXMRglm(yn ~ x, family = "gaussian")))

       prior size post>0 ratio
Comp.1 0.481  102    129 0.791
Comp.2 0.519   98    171 0.573

'log Lik.' -1312.127 (df=13)
AIC: 2650.255   BIC: 2693.133 

I also generate a density distribution with histograms to visulaize both components. This was inspired by a SO answer from the maintainer of betareg.

a = subset(result, clust == 1)
b = subset(result, clust == 2)
hist(a$yn, col = hcl(0, 50, 80), main = "",xlab = "", freq = FALSE, ylim = c(0,0.06))
hist(b$yn, col = hcl(240, 50, 80), add = TRUE,main = "", xlab = "", freq = FALSE, ylim = c(0,0.06))
ys = seq(0, 50, by = 0.1)
lines(ys, dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)), col = hcl(0, 80, 50), lwd = 2)
lines(ys, dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)), col = hcl(240, 80, 50), lwd = 2)

Density of Components

# Joint Histogram
p <- prior(mod)
hist(result$yn, freq = FALSE,main = "", xlab = "",ylim = c(0,0.06))
lines(ys, p[1] * dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)) +
        p[2] * dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)))

enter image description here

Community
  • 1
  • 1
Divi
  • 1,614
  • 13
  • 23
0

You can pass an additional argument to your prediction call.

pred <- predict(mod, NPreg, aggregate = TRUE)[[1]][,1]
vivek mishra
  • 196
  • 2
  • 6