1

I'm trying to run a regression tree in a dataset using the Train function. The dataset has numeric variables that I transformed to categorial trying to solve the error message. I'm also using the TrainControl function, again, to try to solve the error. Help!!!

library(caret)
library(rpart)
library(mlbench)
data(Dataset)
set.seed(1)
ctrl \<- trainControl(method = "cv", savePredictions = TRUE)
model_T \<- train(VALUE\~REF_DATE+Sex+`Age at admission`+`Years since admission`+`Income type`+Statistics+UOM, data = Dataset, method = 'rpart2', trControl = ctrl)
model_T

A structure of Dataset:

spec_tbl_df \[46,464 x 8\] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ REF_DATE             : Factor w/ 11 levels "2006","2007",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Sex                  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ Age at admission     : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 4 ...
$ Years since admission: Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Income type          : Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Statistics           : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 3 3 3 3 3 ...
$ UOM                  : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ VALUE                : num \[1:46464\] 154640 145895 151290 155340 169745 ...
Progman
  • 16,827
  • 6
  • 33
  • 48

1 Answers1

0

The issue is related to the spaces in the column names

library(caret)
library(rpart)
library(mlbench)
ctrl <- trainControl(method = "cv",
                     savePredictions =TRUE)
model_T <- train(VALUE~REF_DATE+Sex+`Age at admission`+`Years since admission`+`Income type`+Statistics+UOM, 
                 data = Dataset, method = 'rpart2', trControl = ctrl)
#Error in `[.data.frame`(m, labs) : undefined columns selected 

If we make use of a dataset with clean names i.e. replace the spaces with underscore etc, it should work - here we used clean_names from janitor to do that

library(janitor)
Dataset2 <- clean_names(Dataset)
names(Dataset2)
#[1] "value"                 "ref_date"              "sex"                   "age_at_admission"      "years_since_admission" "income_type"           "statistics"            "uom"    

Now create the model

model_T2 <- train(value~ref_date+sex+ age_at_admission+years_since_admission+income_type+statistics+uom, 
                  data = Dataset2, method = 'rpart2', trControl = ctrl)

-output

> model_T2
CART 

200 samples
  7 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  maxdepth  RMSE       Rsquared    MAE      
  1         0.9669617  0.03721968  0.7642369
  2         0.9674085  0.02626375  0.7656366
  6         1.0268165  0.03139845  0.8033324

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was maxdepth = 1.

data

set.seed(123)
Dataset <- tibble(VALUE = rnorm(200), REF_DATE = factor(rep(c(2006, 2007), each = 100)), Sex = factor(sample(1:4, size = 200, replace = TRUE)),
                  `Age at admission` = factor(sample(1:4, size = 200, replace = TRUE)),
                  `Years since admission` = factor(sample(1:11, size = 200, replace = TRUE)), 
                  `Income type` = factor(sample(1:6, size = 200, replace = TRUE)),
                  Statistics = factor(sample(1:4, size = 200, replace = TRUE)),
                  UOM = factor(sample(1:2, size = 200, replace = TRUE))
                  )
akrun
  • 874,273
  • 37
  • 540
  • 662