1

I am using the normalmixEM function (algorithim) in R over a data.table object.

It is a simple procedure to run it over the full table. This outputs a mixEM list object, in which the $posterior item is the most interest. I can map that back to the data in an admittedly slightly awkward way using cbind like so:

library(data.table)
library(ggplot2)
library(mixtools)
set.seed(100)

faithfulDT <- data.table(faithful)
faithfulDT[, factorAB := rep(c('a', 'b'), .N)]
# Make some data

qplot(data = faithfulDT, x = eruptions, fill = factor) + facet_grid(factor ~.)
# graph the distribution

faithfulMix <- faithfulDT[, normalmixEM(eruptions)]
cbind(faithfulDT, data.table(faithfulMix$posterior)) # to join posterior probabilities to values. I'm ASSUMING this is the best way to do it?
plot(faithfulMix, whichplots = 2)
# model and graph without factorAB

However, I am struggling to include the by argument of data.table usefully in this workflow. My objective is to run a normalmixEM function by the factorAB. Effectively I want to run two, separated, isolated models on subsets of the data, and then have the posteriors to hand per value in two columns at the end of it in a single data.table as per it's underlying 'split-apply-combine' strategy.

library(data.table)
library(ggplot2)
library(mixtools)
set.seed(100)

faithfulDT <- data.table(faithful)
faithfulDT[, factorAB := rep(c('a', 'b'), .N)]
# Make some data

qplot(data = faithfulDT, x = eruptions, fill = factor) + facet_grid(factor ~.)
# graph the distribution

faithfulMix <- faithfulDT[, normalmixEM(eruptions)]
cbind(faithfulDT, data.table(faithfulMix$posterior)) # to join posterior probabilities to values. I'm ASSUMING this is the best way to do it?
plot(faithfulMix, whichplots = 2)
# model and graph without factorAB

faithfulMixAB <- faithfulDT[, normalmixEM(eruptions), by = factorAB]
# model and graph with factorAB - attempt by

faithfulMixAB <- faithfulDT[, normalmixEM(.SD$eruptions), by = factorAB]
# model and graph with factorAB - attempt by and .SD

faithfulMixAB <- faithfulDT[, normalmixEM(.SD), by = factorAB, .SDcols = "eruptions"]
# model and graph with factorAB - attempt by and .SD and .SDcols

faithfulMixAB <- faithfulDT[, lapply(.SD, normalmixEM), by = factorAB, .SDcols = "eruptions"]
# model and graph by factorAB - lapply
faithfulMixAB
# partial success?

faithfulMixABAssign <- faithfulDT[, mixMDL := lapply(.SD, normalmixEM), by = factorAB, .SDcols = "eruptions"]
# model and graph by factorAB - lapply and try to assign
faithfulMixABAssign
# even more partial success?

Evidently here I have successfully managed to mangle out a solution which appears to have the right numbers, but in largely arbitrary locations.

What am I missing in terms of this workflow which will neaten the output in the case of including the factorAB split? Evidentially I need to find a surrogate for the cbind portion of the work, however my current output is a mess to begin with. Can I improve the output at faithfulmixAB to facilitate this? Potentially skip this entierly and assign the posterior values straight from the function running within data.table?


EDIT

After help from @eddi and a friend IRL what I'm at right now is:

faithfulDT[, mixPostFull.1 := normalmixEM(eruptions)$posterior[,1]]
faithfulDT[, mixPostFull.2 := normalmixEM(eruptions)$posterior[,2]]

which represents the two posterior columns from running the model without splitting the factor. And:

faithfulDT[, mixPostAB.1 := normalmixEM(eruptions)$posterior[,1], by = factorAB]
faithfulDT[, mixPostAB.2 := normalmixEM(eruptions)$posterior[,2], by = factorAB]

Which has both columns, but does split by factor, which is actually what I'm trying to do.

I think that something like both of these are needed as the posterior object is actually 2 vectors, one indicating probability that record is in 1 and group, and the other indicating it is in the second.

Eddi, your current answer has 2 columns, but I don't think that those correspond as laid out above. If anything, the values are slightly different:

eruptions waiting factorAB mixPostFull.1 mixPostFull.2  mixPostAB.1  mixPostAB.2
  1:     3.600      79        a  5.376906e-10  1.000000e+00 1.581467e-11 1.000000e+00
  2:     1.800      54        b  9.999998e-01  1.723648e-07 1.000000e+00 2.112761e-09
  3:     3.333      74        a  1.755506e-06  9.999982e-01 1.405098e-07 9.999999e-01
  4:     2.283      62        b  9.999406e-01  5.939085e-05 9.999974e-01 2.599843e-06
  5:     4.533      85        a  2.215050e-25  1.000000e+00 3.658846e-29 1.000000e+00
 ---                                                                                 
268:     4.117      81        b  6.337730e-18  1.000000e+00 9.658721e-10 1.000000e+00
269:     2.150      46        a  9.999912e-01  8.828998e-06 9.999828e-01 1.724380e-05
270:     4.417      90        b  3.320219e-23  1.000000e+00 1.461450e-12 1.000000e+00
271:     1.817      46        a  9.999998e-01  2.012672e-07 9.999995e-01 4.981589e-07
272:     4.467      74        b  3.912776e-24  1.000000e+00 4.818983e-13 1.000000e+00

What I need really is a method to not have to run the model twice over. I'm pretty sure I can fudge ':=' in there somewhere, but I don't have the time right now. Will return to it in a bit.

Some time later So I overlooked in the previous that I can't just re-run the model to get the second column because, apart from obviously being pretty inefficient, because of the nature of the algorithm unless I set seed, as each run has a different start point, so it will settle on a slightly different answer from one run to the next.

DaveRGP
  • 1,430
  • 15
  • 34

1 Answers1

1

I think you're looking for something like this:

faithfulDT[, {
               result = as.vector(normalmixEM(eruptions)$posterior);
               faithfulDT[, paste0('result.', factorAB) := result];
               NULL
             }
           , by = factorAB]
faithfulDT
#     eruptions waiting factorAB     result.a     result.b
#  1:     3.600      79        a 1.581719e-11 1.000000e+00
#  2:     1.800      54        b 1.405263e-07 9.999974e-01
#  3:     3.333      74        a 3.660230e-29 9.531090e-01
#  4:     2.283      62        b 5.986926e-33 3.630698e-05
#  5:     4.533      85        a 9.999983e-01 6.384911e-12
# ---                                                     
#268:     4.117      81        b 6.545978e-07 1.000000e+00
#269:     2.150      46        a 2.342451e-06 1.562445e-06
#270:     4.417      90        b 1.000000e+00 1.000000e+00
#271:     1.817      46        a 1.724380e-05 1.000000e+00
#272:     4.467      74        b 4.981589e-07 1.000000e+00

After the discussions in the comments and in the OP, turns out the desired answer is:

faithfulDT[, c('mixAB.1', 'mixAB.2') := as.data.table(normalmixEM(eruptions)$posterior)
           , by = factorAB]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • This really does appear to be what I need. Could you explain a little about what's going on here? I get (now) that you are referring to the posterior object of the model from the top line, which is then feeding the second line which formats the headers on the output? What is the NULL in the third line for? Is the whole J argument now a self-contained anonymous function? – DaveRGP Jan 13 '16 at 09:14
  • Additionally, posterior in this model is actually a *pair* of values, and I'm wondering if that's what I'm seeing here from the labels you've chosen. One labelled comp.1 and one labelled comp.2. In this case I would like a column representing the comp.1 for the a split if the factor is a, and a corresponding comp.2 for the a value. Then if the factor is b have a corresponding comp.1 for b and comp.2 for b value. – DaveRGP Jan 13 '16 at 09:33
  • 1
    @DaveRGP Yes, the whole `j-expression` is an anonymous function, and the `NULL` is the return value of that function - we don't care here what we return as the main operation we care about is the assignment in the second line, and `NULL` is a good choice (but really it largely doesn't matter what you put on the 3rd line). I can't really comment on the details in your second comment, as I didn't actually look into what the `normalmixEM` function does. I simply rewrote your expression in the OP to run your operations once for each factor and assign the result of each to a new column. – eddi Jan 13 '16 at 15:48
  • 1
    I notice you put some more info in the OP - I'll take a look in a bit and will reply back. – eddi Jan 13 '16 at 15:51
  • I realise in my continued working that maybe your function provides the information and is displaying both parameters accurately from the same run. However without being able to understand all of your solution, the opposite might also be true. Thanks for your continued help on this. – DaveRGP Jan 13 '16 at 16:15
  • 1
    @DaveRGP are you simply looking for: `faithfulDT[, c('mixAB.1', 'mixAB.2') := as.data.table(normalmixEM(eruptions)$posterior), by = factorAB]` ? It's hard for me to tell what your ultimate desired result is from the OP. – eddi Jan 13 '16 at 18:43
  • thank you very much. That is what I was looking for. If you stick it in an edit for your original answer I'll accept. Thanks again. – DaveRGP Jan 14 '16 at 10:08
  • 1
    sure thing, glad to help – eddi Jan 14 '16 at 15:35