I want to know what variables are important in my decision tree model.
I got the model by using train() of caret package. But the results for attribute usage are strange for fator variables.
Below is my code.
set.seed(123)
ctrl <- trainControl(method = "cv", classProbs = TRUE, summaryFunction=twoClassSummary)
mDt <- train(metS ~ ., data= df_train, method = "C5.0", metric="ROC", trControl = ctrl); mDt
I got the attribute usage by using C5imp(). (The results by using summary(mDt) were the same.)
C5imp(mDt$finalModel)
The attribute usage results are as follows:
- age 100.00
- BMI 100.00
- height 100.00
- weight 100.00
- job7 98.90
- piHeatScore 83.81
- dailyAlcoholIntake_final 82.96
- pi4.L 67.14
- familyIncome^9
- pi17.C 60.33
- pi6.C 59.72
- pi13.L 56.53
- ...
The strange thing is that one factor variable (e.g. 'pi4': Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<"5") has multiple attribute usages. (e.g. 'pi4.L', 'pi4.Q', 'pi4.C', 'pi^4')
It's similar for unordered factors. For example, 'marriage' is a factor w/ 6 levels ("1","2","3","4","5","6"), and the attribute usages are shown for 'marriage2', 'marriage3', 'marriage4', 'marriage5', and 'marriage6'.
However, the results should be like the following:
(The results below were obtained by using C5.0() with same data. One attribute usage is shown for one factor variable.)
mTemp <- C5.0(df_train[,-1], df_train$metS)
C5imp(mTemp)
- BMI 100.00
- age 32.37
- pi6 27.28
- pi13 16.92
- pi9 15.76
- job 9.07
- pi14 2.88
- ...
I think this is caused by a difference when applying C5.0 method by C5.0() and train().
I want to use train() of caret package, because it automatically applys cross validation etc.
Please help me.