1

I used the gbm() function to create the model and I want to get the accuracy. Here is my code:

df<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)

str(df)

F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])

library(caret)

set.seed(1000)
intrain<-createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train<-df[intrain, ]
test<-df[-intrain, ]

install.packages("gbm")
library("gbm")

df_boosting<-gbm(Creditability~.,distribution = "bernoulli", n.trees=100, verbose=TRUE, interaction.depth=4,
                 shrinkage=0.01, data=train)
summary(df_boosting)

yhat.boost<-predict (df_boosting ,newdata =test, n.trees=100)
mean((yhat.boost-test$Creditability)^2) 

However, when using the summary function, an error appears. The error message is as follows.

Error in plot.window(xlim, ylim, log = log, ...) : 
  유한한 값들만이 'xlim'에 사용될 수 있습니다
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf

And, When measuring the MSE with the mean function, the following error also appears:

Warning message:
In Ops.factor(yhat.boost, test$Creditability) :
  요인(factors)에 대하여 의미있는 ‘-’가 아닙니다.

Do you know why these two errors appear? Thank you in advance.

신익수
  • 67
  • 3
  • 8

1 Answers1

2

In your code the problem is in the definition of the (binary) response variable Creditability. You declare it as factor but gbm needs a numerical response variable.

Here is the code:

df <- read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)

F <- c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
str(df)

Creditability now is a binary numerical variable:

'data.frame':   1000 obs. of  21 variables:
 $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Account.Balance                  : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2 ...
 $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
 $ Payment.Status.of.Previous.Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3 ...
 $ Purpose                          : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4 ...
 ...

... and the remaining part of the code works nicely:

library(caret)
set.seed(1000)
intrain <- createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train <- df[intrain, ]
test <- df[-intrain, ]

library("gbm")
df_boosting <- gbm(Creditability~., distribution = "bernoulli", 
       n.trees=100, verbose=TRUE, interaction.depth=4,
       shrinkage=0.01, data=train)
par(mar=c(3,14,1,1))
summary(df_boosting, las=2)

enter image description here

##########
                                                                var    rel.inf
Account.Balance                                     Account.Balance 36.8578980
Credit.Amount                                         Credit.Amount 12.0691120
Duration.of.Credit..month.               Duration.of.Credit..month. 10.5359895
Purpose                                                     Purpose 10.2691646
Payment.Status.of.Previous.Credit Payment.Status.of.Previous.Credit  9.1296524
Value.Savings.Stocks                           Value.Savings.Stocks  4.9620662
Instalment.per.cent                             Instalment.per.cent  3.3124252
...
##########

yhat.boost <- predict(df_boosting , newdata=test, n.trees=100)
mean((yhat.boost-test$Creditability)^2) 

[1] 0.2719788

Hope this can help you.

Marco Sandri
  • 23,289
  • 7
  • 54
  • 58
  • Why I should change Creditability variable's type??It is a factor type variable consisting of 0 and 1. And, Is there a way to get the accuracy in% form instead of MSE? Or is MSE the only way to measure accuracy? – 신익수 May 31 '17 at 13:51
  • @신익수 I changed `Creditability` from factor to numeric only because it is a requirement of `gbm`. I did not consider the method that you used for calculating the predictive performance of `gbm`. Anyway, MSE is not an appropriate method in this case. I suggest to use for example the method based on the ROC curve. – Marco Sandri May 31 '17 at 14:02
  • @Macro Sandri Then, to perform the gbm in r, Do I have to change the target variable(dependent variable) as numeric ? ? not category?? But, Data is related to classification, not regression. – 신익수 Jun 05 '17 at 03:59
  • Using the option `distribution = "bernoulli"`, `gbm` knows that the response variable needs to be treated as a binary categorical factor. – Marco Sandri Jun 05 '17 at 09:13