2

I am trying to use model-based recursive partitioning (MOB) with the mob() function (from the partykit package) to to obtain the different parameters associated to each feature depending on the optimal partition found using the logistic() regression (glm-binomial) function. I had to define my model.

Following this example on page 7: https://cran.r-project.org/web/packages/partykit/vignettes/mob.pdf I created a logit function that estimates the values and would return the estimates etc. of the logistic() function. However, the definition of the function does not seem to be the correct one.

library(partykit)
logit_func <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
  glm(y ~ 0 + x, family = binomial, start = start, ...)
}

p <- mob(future~., data=sample, fit = logit_func)

... and getting the following error

Error in model.frame.default(formula = y ~ 0 + x, drop.unused.levels = TRUE) : 
  invalid type (NULL) for variable 'x' 

The sample dataframe is the following:

sample <- structure(list(future = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L), .Label = c("0", "1"), class = "factor"), HHk = c(0.412585987717856, 
1, 1, 1, 1, 1, 1, 1, 0.865684350743137, 0.685221125225357), HHd = c(0.529970735028671, 
1, 1, 1, 0.611295754192343, 0.171910197073699, 0.722887386610618, 
0.457585763978574, 0.517888089662373, 0.401285262785306), via_4 = structure(c(1L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), 
    region_5 = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L), .Label = c("0", "1"), class = "factor")), row.names = c(NA, 
10L), class = "data.frame")

Any clue?

Thank you :)

vog
  • 770
  • 5
  • 11

2 Answers2

2

Apparently, the problem is related to the option formula within partykit::mob. I don't know which model do you have in mind, but you did not specify any partition variable (Z). The following works, but do not find any breaks tho. I assume that it is because of how small the data set is.

The fitted model is assuming that you are fitting a model where HHk is your regressor and the HHd is being used as a partition variable.

p <- mob(formula = future ~ HHk  | HHd ,
         data=sample,
         fit = logit_func)

# Model-based recursive partitioning (logit_func)
# 
# Model formula:
#   future ~ HHk | HHd
# 
# Fitted party:
#   [1] root: n = 10
# x(Intercept)         xHHk 
# -1.386266     2.006611  
# 
# Number of inner nodes:    0
# Number of terminal nodes: 1
# Number of parameters per node: 2
# Objective function: 6.557608
  • Actually, in the case with only a single formula part on the right-hand side _all_ variables are used as partitioning variables (and none are used as regressors). See `?mob`, especially the description of the `formula` argument and the "Details" section. – Achim Zeileis Jan 09 '21 at 00:06
  • 1
    Thank you for this tip, @AchimZeileis! I didn't know that, quite a useful feature. – Álvaro A. Gutiérrez-Vargas Jan 11 '21 at 07:01
2

In your mob() call your formula only has a single right-hand side of type y ~ z - as opposed to having a two-part model on the right-hand side of type y ~ x | z. The z variables are the ones used for splitting/partitioning in the tree and the x variables are the ones used as regressors in the model. (As already pointed out in the response by Álvaro.)

In principle, it is fine not to have any regressors, you can simply use a constant fit (i.e., intercept only model). However, the logit_func() you defined does not catch this case. There are three ways to remedy this:

  1. Catch the case if(is.null(x)) inside logit_func() and then use glm(y ~ 1, ...).

  2. Keep logit_func() as it is, and specify the regression on the intercept explicitly: mob(future ~ 1 | ., data=sample, fit = logit_func).

  3. Use the dedicted glmtree() function rather than the general mob() plus hand-crafted logit_func(): glmtree(future ~ ., data = sample, family = binomial).

All three will lead to the same tree but Strategy 3 is strongly preferred for a number of reasons: (a) It is readily available and does not require creating custom code. (b) The fitting function used internally is computationally more efficient (e.g., avoids repetitive formula parsing etc.). (c) There are better methods available for the resulting tree, e.g., a nicer plot() and more options in the predict() method.

Additionally, it might make sense to consider some of the explanatory variables as regressors and others as splitting variables (as suggested by Álvaro). But this depends on the data and the application case and it's hard to make recommendations without further context.

The results on your sample data are shown below. Of course, on this small data set no splits are found but on the full data set it should hopefully work as expected.

p <- glmtree(future ~ ., data = sample, family = binomial)
p
## Generalized linear model tree (family: binomial)
## 
## Model formula:
## future ~ 1 | .
## 
## Fitted party:
## [1] root: n = 10
##     (Intercept) 
##       0.4054651  
## 
## Number of inner nodes:    0
## Number of terminal nodes: 1
## Number of parameters per node: 1
## Objective function (negative log-likelihood): 6.730117
Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49