I am trying to program a parallelized for loop where inside I am trying to optimally find the best GLM to model only the variables that have the lowest p-value to see whether or not I am going to play tennis (yes/no in binary).
For example, I have a table (and a dataframe of it) that has meteorological data sets. I construct the GLM model by seeing which one of these models the lowest p-value first
PlayTennis ~ Precip
PlayTennis ~ Temp,
PlayTennis ~ Relative_Humidity
PlayTennis ~ WindSpeed)
Let's say PlayTennis ~ Precip
has the lowest p-value. So the next loop iteration in repeat is to see what other variable will have the lowest p-value.
PlayTennis ~ Precip + Temp
PlayTennis ~ Precip + Relative_Humidity
PlayTennis ~ Precip + WindSpeed
This will continue until there are no more significant variables (P-value greater than 0.05). We thus get the final output of PlayTennis ~ Precip + WindSpeed
(this is all hypothetical).
Is there any recommendation on how I can parallelize this code on various cores? I have come across a new function for glm called speedglm
from the library speedglm. This does improve but not by much. I also looked into foreach
loop but I am not sure on how it can communicate with each thread to know if which p-value is greater or lower for the various runs. Thank you in advance for any help.
d =
Time Precip Temp Relative_Humidity WindSpeed … PlayTennis
1/1/2000 0:00 0 88 30 0 1
1/1/2000 1:00 0 80 30 1 1
1/1/2000 2:00 0 70 44 0 1
1/1/2000 3:00 0 75 49 10 0
1/1/2000 4:00 0.78 64 99 15 0
1/1/2000 5:00 0.01 66 97 15 0
1/1/2000 6:00 0 74 88 8 0
1/1/2000 7:00 0 77 82 1 1
1/1/2000 8:00 0 78 70 1 1
1/1/2000 9:00 0 79 71 1 1
The code that I have is as follows:
newNames <- names(d)
FRM <- "PlayTennis ~"
repeat
{
for (i in 1:length(newNames))
{
frm <- as.formula(paste(FRM, newNames[i], sep =""))
GLM <- glm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
data = d, family = binomial())
# GLM <- speedglm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
# data = d, family = binomial())
temp <- coef(summary(GLM))[,4][counter]
if (i == 1) # assign min p value, location, and variable name to the first iteration
{
MIN <- temp
LOC <- i
VAR <- newNames[i]
}
if (temp < MIN) # adjust the min p value accordingly
{
MIN <- temp
LOC <- i
VAR <- newNames[i]
}
}
if(MIN > 0.05) # break out of the repeat loop when the p-value > 0.05
{
break
}
FRM <- paste(FRM, VAR, " + ", sep = "") # create new formula
newNames <- newNames[which(newNames != VAR)] # removes variable that is the most significant
counter <- counter + 1
}
Code that I have tried but not working
newNames <- names(d)
FRM <- "PlayTennis ~"
repeat
{
foreach (i = 1:length(newNames)) %dopar%
{
frm <- as.formula(paste(FRM, newNames[i], sep =""))
GLM <- glm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
data = d, family = binomial())
# GLM <- speedglm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
# data = d, family = binomial())
temp <- coef(summary(GLM))[,4][counter]
if (i == 1) # assign min p value, location, and variable name to the first iteration
{
MIN <- temp
LOC <- i
VAR <- newNames[i]
}
if (temp < MIN) # adjust the min p value accordingly
{
MIN <- temp
LOC <- i
VAR <- newNames[i]
}
}
if(MIN > 0.05) # break out of the repeat loop when the p-value > 0.05
{
break
}
FRM <- paste(FRM, VAR, " + ", sep = "") # create new formula
newNames <- newNames[which(newNames != VAR)] # removes variable that is the most significant
counter <- counter + 1
}