0

I am doing research on applying a genetic algorithm to binary logistic regression. I have a few questions to be clarified. Can you please help me?

  1. Can I use AIC or BIC as the fitness function in the GA? (I used them and results show that GA is more accurate than the traditional binary logistic model. However, I found that in most of the papers, they used AUC as the fitness function)

  2. I tried GA using AUC as the fitness function according to this paper (http://atm.amegroups.com/article/view/18292/html) and it gives following error. Can you create a small reproducible example to overcome this problem?

Error in model.frame.default(formula = as.numeric(tey) ~ predict.glm(trm, : variable lengths differ (found for 'predict.glm(trm, newdata = ted, type = "response")')

5 Answers5

0

In the galgo package, the cost function can be custom-defined. Can you run the program as described in the paper? For example, you can define AUC as your goal; and you use neural network for the prediction the following code can help:

reg.fitness <- function(chr, parent,tr,te,res) {
  try <- as.factor(parent$data$classes[tr])
  trd <-
    data.frame(parent$data$data[tr,as.numeric(chr)])
  trm <- nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F,
                    size = 5)
  tey <- as.factor(parent$data$classes[te])
  ted <-
    data.frame(parent$data$data[te,as.numeric(chr)])
  pred

 <- predict(trm,newdata = cbind(ted,tey=tey),type = "raw")
  if(res){
    roc(tey,pred,
        levels=levels(tey),
        direction = "<")$auc
  }
  else{
    predict(trm,newdata=cbind(ted,tey=tey),type="class")
  }

}

you can adapt the model by modify this chunk: trm <- nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F, size = 5)

Z. Zhang
  • 637
  • 4
  • 16
  • @ChathuriAththanayake can you rebuild your fitnesses function with above code? – Z. Zhang Apr 08 '20 at 13:58
  • Thank you, @Z. Zhang I got this error (**Error in model.frame.default(formula = as.numeric(tey) ~ predict.glm(trm, : variable lengths differ (found for 'predict.glm(trm, newdata = ted, type = "response")')**) when I'm running the code according to the paper. However, I couldn't find the object 'pred' in the code for the neural network for the prediction. How can I adapt it? – Chathuri Aththanayake Apr 08 '20 at 14:03
  • I couldn't update the object called 'pred' in the above code. – Chathuri Aththanayake Apr 08 '20 at 14:07
  • @ Z. Zhang. Thank you very much. I was able to run the code. Unfortunately, the fitness value is 1 from the beginning. I tried this code for a different data set, it also found that the fitness function = 1 from the 1st iteration. This is the full code that I used for 3 independent categorical variables with binary dependent variable. – Chathuri Aththanayake Apr 10 '20 at 06:45
  • What’s the sample size ? – Z. Zhang Apr 10 '20 at 06:52
  • @ Z. Zhang total sample size = 498 and I took 331 for training – Chathuri Aththanayake Apr 10 '20 at 07:19
0

@Z. Zhang

reg.fitness = function(chr, parent,tr,te,res) {
  try=as.factor(parent$data$classes[tr])
  trd =
    data.frame(parent$data$data[tr,as.numeric(chr)])
  trm = nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F,
                    size = 3)
  tey = as.factor(parent$data$classes[te])
  ted =
    data.frame(parent$data$data[te,as.numeric(chr)])
 pred=predict(trm,newdata = cbind(ted,tey=tey),type = "raw")
  if(res){
    roc(tey,pred,levels=levels(tey),
        direction = "<")$auc
  }
  else{
    predict(trm,newdata=cbind(ted,tey=tey),type="class")
  }

}




reg.bb = configBB.VarSel(data=t(data_set[,-ncol(data_set)]), 
                          classes=data_set$y, 
                          classification.method="user", 
                          classification.userFitnessFunc=reg.fitness, 
                          chromosomeSize=3, niches=1, maxSolutions=10,
                          goalFitness = 0.9, saveVariable="reg.bb",
                          saveFrequency=50, saveFile="reg.bb.Rdata", 
                          main="Logistic")
blast(reg.bb)

This is the output that I got for first 4 iterations

[Bb] Starting, Solutions=10
[Bb]    #bb Sol Last    Fitness %Fit    Gen Time    Elapsed Total   Remaining

[e] Starting: Fitness Goal=0.9, Generations=(10 : 200)
[e] Elapsed Time    Generation  Fitness %Fit    [Next Generations]
[e] 0h 0m 0s    (m) 0   1   111.11% +GGGGGGGGGG
[e] 0h 0m 9s    *** 11  1   111.11% FINISH: 1 2 1 

[Bb]    1   1   Sol Ok  1   111.11% 11  9.33s   9s  10s 42s (0h 0m 42s )

[e] Starting: Fitness Goal=0.9, Generations=(10 : 200)
[e] Elapsed Time    Generation  Fitness %Fit    [Next Generations]
[e] 0h 0m 0s    (m) 0   1   111.11% +GGGGGGGGGG
[e] 0h 0m 10s   *** 11  1   111.11% FINISH: 1 1 3 

[Bb]    2   2   Sol Ok  1   111.11% 11  10.35s  20s 22s 50s (0h 0m 50s )

[e] Starting: Fitness Goal=0.9, Generations=(10 : 200)
[e] Elapsed Time    Generation  Fitness %Fit    [Next Generations]
[e] 0h 0m 0s    (m) 0   1   111.11% +GGGGGGGGGG
[e] 0h 0m 10s   *** 11  1   111.11% FINISH: 3 1 1 

[Bb]    3   3   Sol Ok  1   111.11% 11  9.93s   30s 34s 50s (0h 0m 50s )

[e] Starting: Fitness Goal=0.9, Generations=(10 : 200)
[e] Elapsed Time    Generation  Fitness %Fit    [Next Generations]
[e] 0h 0m 0s    (m) 0   1   111.11% +GGGGGGGGGG
[e] 0h 0m 10s   *** 11  1   111.11% FINISH: 1 2 2 

[Bb]    4   4   Sol Ok  1   111.11% 11  10s 40s 45s 45s (0h 0m 45)

Plot for all 1000 iterations where it gave same fitness function value as 1

0

@ Z. Zhang, here is a reproducible example of my code.

library(pROC)
library(galgo) 
library(rtkore)
library(Rcpp)
library(aod)
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
attach(mydata)


reg.fitness = function(chr, parent,tr,te,res) {
  try=as.factor(parent$data$classes[tr])
  trd =
    data.frame(parent$data$data[tr,as.numeric(chr)])
  trm = nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F,size = 2)
  tey = as.factor(parent$data$classes[te])
  ted =
    data.frame(parent$data$data[te,as.numeric(chr)])
 pred=predict(trm,newdata = cbind(ted,tey=tey),type = "raw")
  if(res){
    roc(tey,pred,levels=levels(tey),
        direction = "<")$auc
  }
  else{
    predict(trm,newdata=cbind(ted,tey=tey),type="class")
  }

}




reg.bb = configBB.VarSel(data=t(mydata[,-ncol(mydata)]), 
                          classes=admit ,
                          classification.method="user", 
                          classification.userFitnessFunc=reg.fitness, 
                          chromosomeSize=2 ,niches=1, maxSolutions=10,
                          goalFitness = 0.9, saveVariable="reg.bb",
                          saveFrequency=50, saveFile="reg.bb.Rdata", 
                          main="Logistic")
blast(reg.bb)

Plot(blast(reg.bb))
0

The problem is that you use the admin to predict admin, which will of course get a AUC of 1. here is the revised code which can run expectedly in my computer.

library(pROC) library(galgo)
library(rtkore) library(Rcpp) library(aod) mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

reg.fitness = function(chr, parent,tr,te,res) {
try=as.factor(parent$data$classes[tr]) trd = data.frame(parent$data$data[tr,as.numeric(chr)]) trm = nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F,size = 2) tey = as.factor(parent$data$classes[te]) ted = data.frame(parent$data$data[te,as.numeric(chr)]) pred=predict(trm,newdata = cbind(ted,tey=tey),type = "raw") if(res){ roc(tey,pred,levels=levels(tey), direction = "<")$auc } else{ predict(trm,newdata=cbind(ted,tey=tey),type="class") } }

reg.bb = configBB.VarSel(data=t(mydata[,-1]), classes=mydata$admit , classification.method="user", classification.userFitnessFunc=reg.fitness, chromosomeSize=2 ,niches=1, maxSolutions=10, goalFitness = 0.9, saveVariable="reg.bb", saveFrequency=50, saveFile="reg.bb.Rdata", main="Logistic")

blast(reg.bb)

Plot(blast(reg.bb))

Z. Zhang
  • 637
  • 4
  • 16
0

for the ANN in the example, I use the nnet function, it has only one hidden layer, the input and output is determined by the data dimension. You can reference the manual help for the nnet function. Furthermore you can apply any structure of ANN by replace the function with any other functions

Z. Zhang
  • 637
  • 4
  • 16