-2

I'm trying to get the error rates for a Naive Bayes classifier, by adding in each variable incrementally. For example I have 25 variables in my dataset. I want to get the error rates of the model as I add in one variable at a time. So you know it would output the error rate of the model with the first 2 columns, the error rate with the first 3 columns, then with the first 4 columns, and so on up to the last column.

Here is the pseudocode of what I'm trying to achieve

START
IMPORT DATASET WITH ALL VARIABLES

num_variables = num_dataset_cols
i= 1

WHILE (i <= num_variables)
{
   CREATE NEW DATASET WITH x COLUMNs

   BUILD THE MODEL 
   GET THE ERROR RATE

   ADD IN NEXT COLUMN

   i = i + 1
}

Here is a reproducible question. Obviously you can't build a NB classifier with this data, but that's not my problem. My problem is adding in the columns one by one. So far, the only way I can do it is by overwriting each column. For a NB classifier, the first column is the class node, so there must be at least 2 columns starting off in order for it to run.

#REPRODUCIBLE EXAMPLE
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")

dataset <- data.frame(col1, col2, col3, col4,col5)

num_variables <- ncol(dataset)

i <- 1

while i <= num_variables 
{
data <- dataset[c(1, i+1)]
str(data)

#BUILD MODEL AND GET VALIDATION ERROR

#INCREMENT i TO GET NEXT COLUMN
i <- i + 1

}

You should be able to see from the str(data) that each time the column is overwritten. Does anyone know how I could go about adding each column without overwriting the previous one? Someone suggested an array to me, but I'm not too familiar with arrays in R. Would this work?

Eoin
  • 330
  • 1
  • 3
  • 15

3 Answers3

1

I think this is what you want.

col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")

dataset <- data.frame(col1, col2, col3, col4,col5)
dataset

num_variables <- ncol(dataset)
num_variables
i <- 1

while (i <= num_variables) {

data <- dataset[, 1:i]

print(str(data))

#BUILD MODEL AND GET VALIDATION ERROR

#INCREMENT i TO GET NEXT COLUMN
i <- i + 1

}

Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
NULL
'data.frame':   5 obs. of  2 variables:
 $ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ col2: num  1 2 3 4 5
NULL
'data.frame':   5 obs. of  3 variables:
 $ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ col2: num  1 2 3 4 5
 $ col3: logi  TRUE FALSE FALSE TRUE FALSE
NULL
'data.frame':   5 obs. of  4 variables:
 $ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ col2: num  1 2 3 4 5
 $ col3: logi  TRUE FALSE FALSE TRUE FALSE
 $ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
NULL
'data.frame':   5 obs. of  5 variables:
 $ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ col2: num  1 2 3 4 5
 $ col3: logi  TRUE FALSE FALSE TRUE FALSE
 $ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
 $ col5: Factor w/ 5 levels "10","100","15",..: 1 3 5 2 4
NULL
Luis Candanedo
  • 907
  • 2
  • 9
  • 12
  • Thank you! This works great! It seems like the only significant change you made was adding in : to the line `data <- dataset[, 1:i]`. Does that increment each row one by one? – Eoin Jan 26 '16 at 14:15
  • You are welcome. yes, that''s the main change. Also don't forget the brackets () in the while statement. – Luis Candanedo Jan 26 '16 at 14:19
0

You can use append function after defining output variable

data <- dataset[c(1, i+1)]
append(output, data)
str(data)
Szymon Roziewski
  • 956
  • 2
  • 20
  • 36
0

Using the "assign" function within a while loop can be helpful for issues like this. You don't show the model syntax, but something like this should work:

dataset$errorrate <- [whatever makes this calculation, assuming it is vectorized]
name <- paste0(errorrate, i)
assign(name, dataset$errorrate)

...

This should leave you with i variables containing error estimate for each model run. If you are looking for one parameter estimate per model you can assign the single estimate a unique name within the global environment using the process above and then rbind them together after the loop has finished

Derek Darves
  • 192
  • 1
  • 5
  • Thanks for your help! I don't think this will work for me, because the package I'm using to get the cv error rate doesn't give me a single value I can assign to a vector. However it should work great for my logistic regression model! – Eoin Jan 26 '16 at 14:21