partykit minsize option drops branches that exceed minsize

Question

I'm using the lmtree() function from partykit to partition data using linear regressions. The regressions use a weight, and I want to ensure that each branch has a minimum total weight, which I specify with the minsize option. For instance, in the following example the tree only has two branches instead of three because x1=="C" has too small a weight to be in its own branch.

n <- 100
X <- rbind(
  data.frame(TT=1:n, x1="A", weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1="B", weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1="C", weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)
X$x1 <- factor(X$x1)
tr <- lmtree(y ~ TT | x1, data=X, weight=weight, minsize=150)

Fitted party:
[1] root
|   [2] x1 in A: n = 200
|       (Intercept)          TT 
|         0.7724903   0.2002023 
|   [3] x1 in B, C: n = 300
|       (Intercept)          TT 
|         0.5759213   0.4659592

I also have some real-world data that unfortunately is confidential but is leading to some behavior that I do not understand. When I do not specify minsize it builds a tree with 30 branches, where in every branch the total weight n is a large number. However, when I specify a minsize that is well below the total weight of every branch from this first tree the result is a new tree with many fewer branches. I would not have expected the tree to change at all because it seems that minsize is not binding. Is there any explanation for this result?

UPDATE

Providing an example

n <- 100
X <- rbind(
  data.frame(TT=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)
tr <- lmtree(y ~ TT | x1, data=X, weights = weight)

Fitted party:
[1] root
|   [2] x1 <= 0.29787: n = 200
|       (Intercept)          TT 
|         0.8431985   0.1994021 
|   [3] x1 > 0.29787
|   |   [4] x1 <= 0.69515: n = 200
|   |       (Intercept)          TT 
|   |         0.6346980   0.3995678 
|   |   [5] x1 > 0.69515: n = 100
|   |       (Intercept)          TT 
|   |         0.4792462   0.5987472

Now let's set minsize=150. The tree no longer has any splits even though x1 <= 0.3 and x1 > 0.3 would work.

tr <- lmtree(y ~ TT | x1, data=X, weights = weight, minsize=150)

Fitted party:
[1] root: n = 500
    (Intercept)          TT 
      0.6870078   0.3593374

Achim Zeileis · Accepted Answer · 2017-07-30T00:28:51.917

Two rules applied in mob() (the infrastructure underlying lmtree()) are important in this context which may benefit from more explicit discussion:

If mob() selects a splitting variable at any stage that then does not lead to a single admissible split (in terms of minimal node size), then splitting stops at that point. This is in contrast to ctree() which always performs a split if a significant test was detected - even if the second-best variable was non-significant. It would probably be good to offer more granular control over this - and we have it on our wishlist for the upcoming revision of the package.
By default the weights are interpreted as case weights, i.e., mob() thinks that there were w independent observations identical to the given one. Thus, the number of observations is the sum of weights. But note that this also affects the significance tests for which the sample size increases!

As for your main question: It's hard to come up with an explanation without any reproducible example. I agree that partykit should behave in the way you describe it - but maybe there is one important but not so obvious detail that you haven't noticed yet... It would be good if you could come up with a small/simple artificial data set that replicates the problem.

Update

As already pointed out in the comments: Thanks for the reproducible example in your updated question. This helped me track down a bug in mob() in handling case weights. There was an error in the computation of the test statistic in the presence of case weights, thus leading to incorrect split variable selection and stopping criterion. I have just fixed this bug and the new partykit development version is available from R-Forge at https://r-forge.r-project.org/R/?group_id=261. (Note, however, that R-Forge at the moment only builds Windows binaries for R 3.3.x. If a more recent Windows version is used, please use type = "source" to install the source package - and make sure you have the necessary Rtools installed.)

In your example I just set a random seed for exact reproducibility. The weighted data is set up as:

set.seed(1)
n <- 100
X <- rbind(
  data.frame(TT=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)

Then the weighted tree can be fitted as before. In this particular example the tree structure remains unaffected but the test statistics and p-values of the parameter instability test in each node changes somewaht:

library("partykit")
tr1 <- lmtree(y ~ TT | x1, data = X, weights = weight)
plot(tr1)

Adding the minsize = 150 argument now has the expected effect of just avoiding the split in node 3.

tr2 <- lmtree(y ~ TT | x1, data = X, weights = weight, minsize = 150)
plot(tr2)

To check that the latter actually does the right thing we compare it with the tree for the explicitly expanded data. Thus, as the data are regarded as case weights here, we can inflate the data set by repeating thos observations with weights greater than 1.

Xw <- X[rep(1:nrow(X), X$weight), ]
tr3 <- lmtree(y ~ TT | x1, data = Xw, minsize = 150)

The resulting coefficients are the same (up to very small numerical differences):

all.equal(coef(tr2), coef(tr3))
## [1] TRUE

And, more importantly, all test statistics and p-values in the nodes are also the same:

library("strucchange")
all.equal(sctest(tr2), sctest(tr3))
## [1] TRUE

Hi Achim, changing minsize changes the test statistic for the parameter instability test, which I can see when setting verbose=TRUE. Inside of the function `mob_grow_fluctests` there is a line `from <- max(from, minsize)` which is causing this. Can you elaborate on how this would lead to the pruning of branches in the original tree, which was the one with no minsize but every branch showing n > minsize. — Abiel, Jul 26 '17 at 20:26
Ah, good point, I didn't think of this yesterday. If you split along a numeric variable, then the split variable selection by default assesses all potential splits between 10% and 90% of the splitting variable. However, if `minsize` is larger (say 20%) than 10% of the data, then the range is adapted accordingly. If the true split is indeed between 20% and 80% of the data, this will increase power. However, if the true split is between 10% and 20% (or 80% and 90%), you will lose power. However, please double check whether everything is sound here with your non-standard use of weights. — Achim Zeileis, Jul 26 '17 at 20:55
I've added an update to the original post based on the comment. — Abiel, Jul 27 '17 at 21:19
Thanks, this was helpful. It is a bug in the handling of case weights for `mob()`. It often has no dramatic consequences (and hence probably was undiscovered for so long) but can have as your example shows. I'll investigate this in more detail and then update my answer. — Achim Zeileis, Jul 28 '17 at 12:29
The fixed version of `partykit` is now available from R-Forge, see my updated reply for the details. Thx! — Achim Zeileis, Jul 30 '17 at 00:29

partykit minsize option drops branches that exceed minsize

1 Answers1

Update