0

I am using the R programming language. I am learning about how to loop a procedure and store the results into a table. For this example, I first generated some data:

#load libraries    
library(caret)
library(rpart)
        
#generate data
            
a = rnorm(1000, 10, 10)

b = rnorm(1000, 10, 5)

c = rnorm(1000, 5, 10)

group <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.5,0.5) )
group_1 <- 1:1000

#put data into a frame
d = data.frame(a,b,c, group, group_1)

d$group = as.factor(d$group)

Then, I create the final table where I want the results from the loop to be stored:

#create the final results table in which the results of the loop will be stored
final_table = matrix(1, nrow = 6, ncol=2)

Here is the procedure that I want to loop. Basically, I want to fit a decision tree model on this data. I want to fit 6 different decision trees: the variable "group_1" (the response variable) becomes a factor variable ("1" or "0") if "group_1 > i". The "i" variable takes 6 values (400,401,402,403,404,405). Thus, the decision tree is fit 6 times. I want to store the accuracy of each one of these decision trees into the "final_table":

for (i in 400:405) 
{
  d$group_1 = ifelse(d$group_1 > i, "1","0")
  d$group_1 = as.factor(d$group_1)
  
  
  trainIndex <- createDataPartition(d$group_1, p = .8,
                                    list = FALSE,
                                    times = 1)
  training = d[ trainIndex,]
  test  <- d[-trainIndex,]
  
  
  fitControl <- trainControl(## 10-fold CV
    method = "repeatedcv",
    number = 10,
    ## repeated ten times
    repeats = 10)
  
  TreeFit <- train(group_1 ~ ., data = training,
                   method = "rpart2",
                   trControl = fitControl)
  
  pred = predict(TreeFit, test, type = "prob")
  labels = as.factor(ifelse(pred[,2]>0.5, "1", "0"))
  con = confusionMatrix(labels, test$group_1)
  
  #update results into table
  row = i - 399
  final_table[row,1] = con$overall[1]
  final_table[row,2] = i
}

However, this gives me the following errors:

Error in na.fail.default(list(group = c(2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,  : 
      missing values in object
    In addition: Warning message:
    In Ops.factor(d$group_1, i) : ‘>’ not meaningful for factors

Can someone please tell me what I am doing wrong?

Thanks

Sinh Nguyen
  • 4,277
  • 3
  • 18
  • 26
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    The first two line of your loop which convert `group_1` into a factor of `1` & `0` base on its values compare to `i` - First loop would be okay but on 2nd loop the original `1:1000` of the variables already replace by factor of `1` & `0` which result the error you got. – Sinh Nguyen Jan 30 '21 at 02:14

1 Answers1

1

You can create a copy of your original dataframe in any other variable which can be used to overwrite the changed dataframe in every iteration.

library(caret)
library(rpart)

e <- d

for (i in 400:405) {
  d <- e
  d$group_1 = as.integer(d$group_1 > i)
  d$group_1 = as.factor(d$group_1)
  
  trainIndex <- createDataPartition(d$group_1, p = .8,list = FALSE,times = 1)
  training = d[ trainIndex,]
  test  <- d[-trainIndex,]
  
  
  fitControl <- trainControl(## 10-fold CV
    method = "repeatedcv",
    number = 10,
    ## repeated ten times
    repeats = 10)
  
  TreeFit <- train(group_1 ~ ., data = training,
                   method = "rpart2",
                   trControl = fitControl)
  
  pred = predict(TreeFit, test, type = "prob")
  labels = as.factor(ifelse(pred[,2]>0.5, "1", "0"))
  con = confusionMatrix(labels, test$group_1)
  
  #update results into table
  row = i - 399
  final_table[row,1] = con$overall[1]
  final_table[row,2] = i
  
}

final_table
#      [,1] [,2]
#[1,] 0.585  400
#[2,] 0.618  401
#[3,] 0.598  402
#[4,] 0.608  403
#[5,] 0.533  404
#[6,] 0.570  405
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • thank you for your answer! Instead of (i in 400:405), can I replace it with (i in sample(100:400, 10))? I got the following error: Error in na.fail.default(list(group_1 = c(NA_integer_, NA_integer_, NA_integer_, : missing values in object . thank you for all your help – stats_noob Jan 30 '21 at 05:30
  • Yes, but then this line it will fail `row = i - 399` – Ronak Shah Jan 30 '21 at 06:22