0

I am using partykit and noticed a possible varid mismatch (unless I misunderstood something). Below is the example code.

The root node as returned by nodeapply shows variable 5 as the split variable.

Also the first element of the explicitly generated list has split$varid 5. If we look at the iris data frame then the 5th column is Species, and Petal.Width is 4th column which should be the varid for the root node as shown by the j48_party object.

It seems like the varid are actual feature used +1, is this intentional?

> library(partykit)
> library(RWeka)
> data("iris")
> j48 <- J48(Species~., data=iris)
> j48_party <- as.party(j48)
> j48_party

Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

Fitted party:
[1] root
|   [2] Petal.Width <= 0.6: setosa (n = 50, err = 0.0%)
|   [3] Petal.Width > 0.6
|   |   [4] Petal.Width <= 1.7
|   |   |   [5] Petal.Length <= 4.9: versicolor (n = 48, err = 2.1%)
|   |   |   [6] Petal.Length > 4.9
|   |   |   |   [7] Petal.Width <= 1.5: virginica (n = 3, err = 0.0%)
|   |   |   |   [8] Petal.Width > 1.5: versicolor (n = 3, err = 33.3%)
|   |   [9] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)

Number of inner nodes:    4
Number of terminal nodes: 5
> colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
> nodeapply(j48_party)
$`1`
[1] root
|   [2] V5 <= 0.6 *
|   [3] V5 > 0.6
|   |   [4] V5 <= 1.7
|   |   |   [5] V4 <= 4.9 *
|   |   |   [6] V4 > 4.9
|   |   |   |   [7] V5 <= 1.5 *
|   |   |   |   [8] V5 > 1.5 *
|   |   [9] V5 > 1.7 *

> nodes <- as.list(j48_party$node)
> nodes[[1]]$split$varid
[1] 5
Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
Krrr
  • 452
  • 1
  • 3
  • 15

1 Answers1

0

The difference is due to the following: J48() like most other modeling functions (such as lm(), glm(), etc.) does not simply directly use the data supplied but first builds up a model.frame. This already carries out variable transformations (e.g., taking logs, creating factors or Surv() objects), collecting variables that might not be in data but in the calling environment, and leaving out variables that are not in the model formula etc. See ?model.frame for further information and links.

Therefore, the object created by J48() has a model.frame that is not exactly the iris data but the response variable was moved to the first column:

head(model.frame(j48))
##   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1  setosa          5.1         3.5          1.4         0.2
## 2  setosa          4.9         3.0          1.4         0.2
## 3  setosa          4.7         3.2          1.3         0.2
## 4  setosa          4.6         3.1          1.5         0.2
## 5  setosa          5.0         3.6          1.4         0.2
## 6  setosa          5.4         3.9          1.7         0.4

And the information from this is also carried over to the party object.

j48_party$data
## [1] Species      Sepal.Length Sepal.Width  Petal.Length Petal.Width 
## <0 rows> (or 0-length row.names)

[Note: In the case of J48() this only stores the meta-information but drops the actual data because it is not needed here. But this is different for ctree() for example.]

To see that this model.frame() can be different from the original data consider the following situation: we create a new noise variable that is not part of iris but just in the calling environment, take logs, and omit several variables:

set.seed(1) 
noise <- rnorm(150)
j48 <- J48(Species ~ log(Petal.Width) + noise, data = iris)
j48_party <- as.party(j48)
head(model.frame(j48))
##   Species log(Petal.Width)      noise
## 1  setosa       -1.6094379 -0.6264538
## 2  setosa       -1.6094379  0.1836433
## 3  setosa       -1.6094379 -0.8356286
## 4  setosa       -1.6094379  1.5952808
## 5  setosa       -1.6094379  0.3295078
## 6  setosa       -0.9162907 -0.8204684
j48_party$data
## [1] Species          log(Petal.Width) noise           
## <0 rows> (or 0-length row.names)
Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49