3

I used the gbm function to implement gradient boosting. And I want to perform classification. After that, I used the varImp() function to print variable importance in gradient boosting modeling. But... only 4 variables have non-zero importance. There are 371 variables in my big data.... Is it right? This is my code and result.

>asd<-read.csv("bigdatafile.csv",header=TRUE)
>asd1<-gbm(TARGET~.,n.trees=50,distribution="adaboost", verbose=TRUE,interaction.depth = 1,data=asd)

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
 1        0.5840             nan     0.0010    0.0011
 2        0.5829             nan     0.0010    0.0011
 3        0.5817             nan     0.0010    0.0011
 4        0.5806             nan     0.0010    0.0011
 5        0.5795             nan     0.0010    0.0011
 6        0.5783             nan     0.0010    0.0011
 7        0.5772             nan     0.0010    0.0011
 8        0.5761             nan     0.0010    0.0011
 9        0.5750             nan     0.0010    0.0011
10        0.5738             nan     0.0010    0.0011
20        0.5629             nan     0.0010    0.0011
40        0.5421             nan     0.0010    0.0010
50        0.5321             nan     0.0010    0.0010

>varImp(asd1,numTrees = 50)
                    Overall
CA0000801           0.00000
AS0000138           0.00000
AS0000140           0.00000
A1                  0.00000
PROFILE_CODE        0.00000
A2                  0.00000
CB_thinfile2        0.00000
SP_thinfile2        0.00000
thinfile1           0.00000
EW0001901           0.00000
EW0020901           0.00000
EH0001801           0.00000
BS_Seg1_Score       0.00000
BS_Seg2_Score       0.00000
LA0000106           0.00000
EW0001903           0.00000
EW0002801           0.00000
EW0002902           0.00000
EW0002903           0.00000
EW0002904           0.00000
EW0002906           0.00000
LA0300104_SP       56.19052
ASMGRD2          2486.12715
MIX_GRD          2211.03780
P71010401_1         0.00000
PS0000265           0.00000
P11021100           0.00000
PE0000123           0.00000

There are 371 variables. So above the result,I didn't write other variables. That all have zero importance.

TARGET is target variable. And I produced 50 trees. TARGET variable has two levels. so I used adaboost.

Is there a mistake in my code??? There are a little non-zero variables....

Thank you for your reply.

colorlace
  • 816
  • 2
  • 10
  • 17
이순우
  • 79
  • 1
  • 1
  • 10
  • All depends on the data and nothing here shows that this might not be correct. Four features in your data can correctly classify target. That's why all others have zero importance. – discipulus Feb 15 '17 at 03:25
  • I agree with @discipulus. The model selected those variables to predict the outcome. You can try and tune the hyperparameters to see if the variable importance changes. You can force the model to consider other variables if you take these 4 variables out of the data. Maybe try "Bernoulli" or "Binomial" distribution if your target is binary. – syebill Feb 15 '17 at 09:00

2 Answers2

3

You cannot use importance() NOR varImp() this is only for Random Forest.

However, you can use summary.gbm from the gbm package.

Ex:

summary.gbm(boost_model)

Output will look like: enter image description here

UseR10085
  • 7,120
  • 3
  • 24
  • 54
J. G.B.
  • 31
  • 3
1

In your code, n.trees is very low and shrinkage is very high. Just adjust this two factor.

  1. n.trees is Number of trees. N increasing N reduces the error on training set, but setting it too high may lead to over-fitting.
  2. interaction.depth(maximum nodes per tree) is number of splits it has to perform on a tree(starting from a single node).
  3. shrinkage is considered as a learning rate. shrinkage is commonly used in ridge regression where it reduces regression coefficients to zero and, thus, reduces the impact of potentially unstable regression coefficients. I recommend uses 0.1 for all data sets with more than 10,000 records. Also! use a small shrinkage when growing many trees.

If you input 1,000 in n.trees & 0.1 in shrinkage, you can get different value. And if you want to know relative influence of each variable in the gbm, Use summary.gbm() not varImp(). Of course, varImp() is good function. but I recommend summary.gbm().

Good luck.

서영재
  • 96
  • 1
  • 9