0

I am weighing the efficacy of using one monolithic model, versus splitting out in to two different models (a split model) on about 100,000 rows of data. To do so, I am getting results from my split model like so:

preds <- numeric(nrow(DF))
for (i in 1:nrow(DF))
{
  if (DF[i,]$col == condition)
  {
    preds[i] <- predict(glm1, DF[i,])
  }
  else
  {
    preds[i] <- predict(glm2, DF[i,])
  }
}

For whatever reason, this seems to be going extremely slow, especially when compared to just getting press for an entire data frame like so:

preds <- predict(glm1,DF)

Do you have any ideas on how I can optimize the first snippet?

user1775655
  • 317
  • 2
  • 8
  • I'm not at all surprised it's slow. Seems as though you could get this with two 'predict' calls by using an appropriate pair of 'newdata' arguments. – IRTFM Feb 04 '15 at 05:37
  • As I mentioned in another comment, I need to preserve the ordering to be the same as that of the data frame so that I can do things like examine the ROC. – user1775655 Feb 04 '15 at 16:20

1 Answers1

1
preds1 <- predict(glm1, DF[DF$col == condition, ])
preds2 <- predict(glm2, DF[DF$col != condition,])

If you want them in the save vector just use c().

If you want to build a dataframe with actual and predicted values stratified by condition then first make a structure that holds the 'actual' and cond variables, some of which at the moment are not named or attributed to any particular structure, so I will assume they are in a dataframe named DF with the column name "actual":

 compare.df <- data.frame(act=DF$actual, cond =DF$col, pred = NA)
 compare.df[DF$col==condition, 'pred'] <- 
        predict(glm1, DF[DF$col == condition, ])
 compare.df[DF$col !=condition, 'pred'] <- 
        predict(glm2, DF[DF$col != condition, ])
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • The main issue here is that If I want to compare the predicted values to the actual values, I've now lost the original ordering of DM. – user1775655 Feb 04 '15 at 13:50