5

I am trying to program a parallelized for loop where inside I am trying to optimally find the best GLM to model only the variables that have the lowest p-value to see whether or not I am going to play tennis (yes/no in binary).

For example, I have a table (and a dataframe of it) that has meteorological data sets. I construct the GLM model by seeing which one of these models the lowest p-value first

PlayTennis ~ Precip
PlayTennis ~ Temp, 
PlayTennis ~ Relative_Humidity
PlayTennis ~ WindSpeed)

Let's say PlayTennis ~ Precip has the lowest p-value. So the next loop iteration in repeat is to see what other variable will have the lowest p-value.

PlayTennis ~ Precip + Temp
PlayTennis ~ Precip + Relative_Humidity 
PlayTennis ~ Precip + WindSpeed

This will continue until there are no more significant variables (P-value greater than 0.05). We thus get the final output of PlayTennis ~ Precip + WindSpeed (this is all hypothetical).

Is there any recommendation on how I can parallelize this code on various cores? I have come across a new function for glm called speedglm from the library speedglm. This does improve but not by much. I also looked into foreach loop but I am not sure on how it can communicate with each thread to know if which p-value is greater or lower for the various runs. Thank you in advance for any help.

d =

Time          Precip    Temp    Relative_Humidity   WindSpeed   …   PlayTennis    
1/1/2000 0:00   0        88           30                0              1    
1/1/2000 1:00   0        80           30                1              1    
1/1/2000 2:00   0        70           44                0              1    
1/1/2000 3:00   0        75           49               10              0    
1/1/2000 4:00   0.78     64           99               15              0    
1/1/2000 5:00   0.01     66           97               15              0    
1/1/2000 6:00   0        74           88                8              0    
1/1/2000 7:00   0        77           82                1              1    
1/1/2000 8:00   0        78           70                1              1    
1/1/2000 9:00   0        79           71                1              1

The code that I have is as follows:

newNames <- names(d)
FRM <- "PlayTennis ~" 

repeat
{
    for (i in 1:length(newNames))
    {
        frm <- as.formula(paste(FRM, newNames[i], sep =""))
        GLM <- glm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
                    data = d, family = binomial())
        # GLM <- speedglm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
        #                 data = d, family = binomial())

        temp <- coef(summary(GLM))[,4][counter]

        if (i == 1) # assign min p value, location, and variable name to the first iteration
        {
            MIN <- temp
            LOC <- i
            VAR <- newNames[i]
        }

        if (temp < MIN) # adjust the min p value accordingly
        {
            MIN <- temp
            LOC <- i
            VAR <- newNames[i]
        }
    }

    if(MIN > 0.05) # break out of the repeat loop when the p-value > 0.05
    {
        break
    }

    FRM <- paste(FRM, VAR, " + ", sep = "") # create new formula
    newNames <- newNames[which(newNames != VAR)] # removes variable that is the most significant
    counter <- counter + 1
}

Code that I have tried but not working

newNames <- names(d)
FRM <- "PlayTennis ~" 

repeat
{
    foreach (i = 1:length(newNames)) %dopar%
    {
        frm <- as.formula(paste(FRM, newNames[i], sep =""))
        GLM <- glm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
                    data = d, family = binomial())
        # GLM <- speedglm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
        #                 data = d, family = binomial())

        temp <- coef(summary(GLM))[,4][counter]

        if (i == 1) # assign min p value, location, and variable name to the first iteration
        {
            MIN <- temp
            LOC <- i
            VAR <- newNames[i]
        }

        if (temp < MIN) # adjust the min p value accordingly
        {
            MIN <- temp
            LOC <- i
            VAR <- newNames[i]
        }
    }

    if(MIN > 0.05) # break out of the repeat loop when the p-value > 0.05
    {
        break
    }

    FRM <- paste(FRM, VAR, " + ", sep = "") # create new formula
    newNames <- newNames[which(newNames != VAR)] # removes variable that is the most significant
    counter <- counter + 1
} 
svick
  • 236,525
  • 50
  • 385
  • 514
lurodrig
  • 99
  • 3
  • 8
  • `?step` or `?add1` will probably add speed gains before parallelising – user20650 Sep 08 '16 at 01:37
  • Where does this data come from? Is it built in to a package? – Hack-R Sep 08 '16 at 01:50
  • 3
    If you're trying to do variable selection -- don't. Or rather, don't do it this way. Use a regularised approach, like elastic net. You can use the `glmnet` package for this. – Hong Ooi Sep 08 '16 at 01:59
  • 1
    or consider the `glmulti` package if you must. – Ben Bolker Sep 08 '16 at 02:25
  • @user20650, I have tried the step function, but it didn't lead to a model containing significant variables of interest. Which is why I programmed my own "manual" yet automatic way of find the best model but it is quite time consuming and want to see if I can expand that over several cores if possible. – lurodrig Sep 08 '16 at 19:54
  • @Hack-R, the data is confidential, so I made up a random example to demonstrate my objective with a data set that looks like the real data. – lurodrig Sep 08 '16 at 19:55
  • @HongOoi, glmnet looks good but takes over an hour to run. First time I hear about it, so I am learning as much as I can. In addition, I am trying to see which variables glmnet believes to be the most significant, but doesn't seem to have that as an output. – lurodrig Sep 08 '16 at 19:57
  • @BenBolker, glmutli seems like a great package to use, however, my data set is quite large: 106596 x 137. So when I try to run it, I came across a problem (oversized candidate). Seems I still need to build my list manually to input it to glmulti. – lurodrig Sep 08 '16 at 19:59
  • If you don't specifically need to use pvalues then try stepAIC in the MASS package. – Dominik Sep 09 '16 at 04:06
  • If the step routine doesn't keep any variables then you should reconsider your possible variables. But you could also use regsubsets to try every model combination and pick the best one. – Dominik Sep 09 '16 at 04:09
  • @Dominik, unfortunately I will need to use p values. I have tried stepAIC, initially, and it didn't provide a most accurate model needed in terms of significance. Which is why I went to manually finding the optimal model. I tried regsubsets, however, it takes a long time to find an optimal model. Is there a way to parallelize it that you may know of? – lurodrig Sep 09 '16 at 20:03
  • I don't. Also, FYI, I'm fairly certain that of the parallel backends available, only with MPI can the slave processes talk to each other. – Dominik Sep 10 '16 at 22:23
  • @Dominik Awesome! Thank you for the suggestion. – lurodrig Sep 12 '16 at 20:44

0 Answers0