subgroup identification with regression tree

Question

I am very interested in a recent paper on "Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials " (DOI: 10.1002/sim.7064), and I want to reproduce results in the section "Performance of the tree-based regression approach". However, the partitioning tree seems not get the result as I want.

set.seed(123)
n <- 1000
x1 <- rnorm(n)
x2 <- runif(n)
t <- rbinom(n,1,0.5)
x3 <- rbinom(n,1,0.3)
x4 <- rnorm(n)
z <-1+ 2*x1-1.8*x2+(x1>=-0.3)*(x2>=0.4)*t-(x1< -0.3)*(x2<0.4)*t
pr = 1/(1+exp(-z))
y = rbinom(n,1,pr)
dt <- data.frame(x1,x2,x3,x4,t,y)

library(party)
mb <- mob(y~t-1|x1+x2+x3+x4,
  data=dt,
  model = glinearModel,
  family = binomial(),
  control=mob_control(minsplit=100))
plot(mb)

The figure is shown above. It is supposed to split on x1 and x2 at cutoff values of -0.3 and 0.4 defined in the simulation. However, it doesn't appear to do so.x1 dominated the node partition and x2 seems not an important determinant of the partitioning process. What's wrong with the code?

The global additive effects in x1 and x2 need to be accounted for and the tree you specified has no other option except through splits in the tree. One either should include x1 and x2 (and an intercept) in the regressor variables (as opposed to partitioning variables only) - or use the PALM tree extension of MOB (for Partially Additive Linear Model trees). I'll post a proper answer tomorrow when I'm in front of a computer. — Achim Zeileis, Dec 02 '17 at 22:59

score 4 · Accepted Answer · edited Jun 08 '18 at 17:59

The model-based GLM tree you have specified tries to fit the following model: logit(prob) = alpha_tree(x1-x4) * t, where alpha_tree(x1-x4) is a subgroup-specific coefficient from the tree (based on partitioning variables x1-x4) for the treatment indicator t. Thus, this model does not include the possibility for an intercept or global main effects of x1 and x2 - as simulated in your data. As these main effects are quite pronounced the model has no other possibility except to capture these by further splits. Hence, the large tree in your example.

In the classic MOB framework (Zeileis et al. 2008, Journal of Computational and Graphical Statistics, doi:10.1198/106186008X319331), the only option would be to include every relevant regressor in the model to be partitioned, i.e., logit(mu) = beta0_tree(x1-x4) + beta1_tree(x1_x4) * x1 + beta2_tree(x1-x4) * x2 + alpha_tree(x1-x4) * t. This works and will detect subgroups with respect to only the treatment effect alpha * t as well but, of course, loses some efficiency because it re-estimates the beta coefficients in every subgroup as well. A discussion of this approach specifically for subgroup analyses is available in Seibold et al. (2016a), The International Journal of Biostatistics, doi:10.1515/ijb-2015-0032.

Recently, we have suggested an adaptation of MOB that we called PALM tree for partially additive linear model trees (Seibold et al. 2016b, http://arxiv.org/abs/1612.07498). This allows to fit models of the type logit(mu) = beta0 + beta1 * x1 + beta2 * x2 + alpha_tree(x1-x4) * t as you simulated in your question.

Implementations of both the classic GLM-based MOB tree and the PALM tree are available as glmtree() and palmtree(), respectively, in partykit which is recommended over the old party implementation. Applying these to your simulated data above, yields the following. First, it is important that all categorical partitioning variables are also flagged as factor variables (in order to choose the right parameter instability tests):

dt <- transform(dt, x3 = factor(x3))

Then, we can fit the model from which you have simulated with only a subgroup-specific treatment effect, a global main effect of x1 and x2, and partitioning based on x1, x2, x3, x4.

library("partykit")
tr1 <- palmtree(y ~ t - 1 | x1 + x2 | x1 + x2 + x3 + x4, data = dt,
  family = binomial, minsize = 100)
tr1
## Partially additive generalized linear model tree (family: binomial)
## 
## Model formula:
## y ~ t - 1 | x1 + x2 + x3 + x4
## 
## Fitted party:
## [1] root
## |   [2] x1 <= -0.21797: n = 399
## |                t 
## |       -0.9245345 
## |   [3] x1 > -0.21797: n = 601
## |               t 
## |       0.6033979 
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2
## Number of parameters per node: 1
## Objective function (negative log-likelihood): 432.5717
## 
## Linear fixed effects (from palm model):
## (Intercept)          x1          x2 
##   0.7140397   1.7589675  -1.1335779

Thus, this covers the most important parts of the data-generating process but fails to detect the interaction with x2.

plot(tr1)

I played around with setting other seeds or using BIC-based post-pruning (combined with a large significance level) which sometimes could also discover the interaction with x2. Presumably, a larger sample would yield the "true" tree more often, as well. Thus, the model seems to be in principle capable to fit the model you simulated, just not in this particular setting.

Personally, I would always let both the intercept and the treatment effect vary across subgroups. Because if there are any main effects that I overlooked this is likely to yield a better model. If an intercept is included in every subgroup, then it is nicer to code both y and t as factors, yielding nicer plots in the visualization:

dt <- transform(dt, y = factor(y), t = factor(t))
tr2 <- palmtree(y ~ t | x1 + x2 | x1 + x2 + x3 + x4, data = dt,
  family = binomial, minsize = 100)
tr2
## Partially additive generalized linear model tree (family: binomial)
## 
## Model formula:
## y ~ t | x1 + x2 + x3 + x4
## 
## Fitted party:
## [1] root
## |   [2] x1 <= -0.26515: n = 382
## |       (Intercept)          t1 
## |         0.9839353  -1.1376981 
## |   [3] x1 > -0.26515: n = 618
## |       (Intercept)          t1 
## |         0.5331386   0.6566111 
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2
## Number of parameters per node: 2
## Objective function (negative log-likelihood): 432.3303
## 
## Linear fixed effects (from palm model):
##        x1        x2 
##  1.964397 -1.078958

For this data, this fits almost the same model as above. But the display is much easier to read, showing the large difference in absolute treatment effect between the two groups:

plot(tr2)

Finally, if using a plain old MOB without the possibility to include additive main effects, we should include the regressors x1 and x2 in every subgroup:

tr3 <- glmtree(y ~ t + x1 + x2 | x1 + x2 + x3 + x4, data = dt,
  family = binomial, minsize = 100)
tr3
## Generalized linear model tree (family: binomial)
## 
## Model formula:
## y ~ t + x1 + x2 | x1 + x2 + x3 + x4
## 
## Fitted party:
## [1] root
## |   [2] x1 <= -0.26515: n = 382
## |       (Intercept)          t1          x1          x2 
## |         0.5570219  -1.0511317   1.2533975  -1.3899679 
## |   [3] x1 > -0.26515: n = 618
## |       (Intercept)          t1          x1          x2 
## |         0.3573041   0.6943206   2.2910053  -0.9570403 
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2
## Number of parameters per node: 4
## Objective function (negative log-likelihood): 429.2406

Again, this essentially finds the same subgroups as before. However, it loses a bit of efficiency because more regression coefficients have to be estimated in each subgroup while only the t coefficient really changes between the subgroups.

plot(tr3, tp_args = list(which = 1))

score 0 · Answer 2 · answered Dec 04 '17 at 07:57

set.seed(123)
n <- 1000
x1 <- rnorm(n)
x2 <- runif(n)
t <- rbinom(n,1,0.5)
x3 <- rbinom(n,1,0.3)
x4 <- rnorm(n)
z <-1+ 2*x1-1.8*x2+
  (x1>=-0.3)*(x2>=0.4)*t-
  (x1< -0.3)*(x2<0.4)*t
pr = 1/(1+exp(-z))
y = as.factor(rbinom(n,1,pr))
dt <- data.frame(x1,x2,x3=as.factor(x3),
  x4,t=as.factor(t),y)

The treatment effect should be statistically significant at two covariate space:

dt1 <- dt[x1>=-0.3&x2>=0.4,]
dt2 <- dt[x1< -0.3&x2<0.4,]
chisq.test(table(dt1$t,dt1$y))
Pearson's Chi-squared test with Yates' continuity correction

data:  table(dt1$t, dt1$y)
X-squared = 11.684, df = 1, p-value = 0.0006305

prop.table(table(dt1$t,dt1$y),1)
chisq.test(table(dt2$t,dt2$y))
prop.table(table(dt2$t,dt2$y),1)
chisq.test(table(dt$t,dt$y))

I use the glmtree() fucntion and the result is different from your output that interaction with x2 is correctly identified on the left, but the right side fails to capture the partition on x2.

library(partykit)
tr1 <- glmtree(y~t+x1+x2 | x1+x2+x3+x4,
  data=dt,
  family = binomial())
plot(tr1,tp_args = list(which = 1))

You are correct that the treatment effect is statistically significant in the true groups. But this is not what MOB searches for. It tries to detect if the treatment effect is **significantly different** between subgroups. This is clearly the case for x1 but - conditional on the split in x1 - the treatment subgroup interaction for x2 is rather small. Using the true splits: `summary(glm(y ~ x1 + x2 + factor(x2 < 0.4) * t, data = dt, subset = x1 < -0.3, family = binomial))`, i.e., a p-value of 2.83% for only the interaction. In a search over variables and splits, this becomes non-significant. — Achim Zeileis, Dec 04 '17 at 10:33
Conversely, in `summary(glm(y ~ x1 + x2 + factor(x2 >= 0.4) * t, data = dt, subset = x1 >= -0.3, family = binomial))` the treatment subgroup interaction has a p-value of 1.58%. Again, searching over more variables and/or more split points and/or assessing interactions with other coefficients as well may lead to non-significant results and no splitting. — Achim Zeileis, Dec 04 '17 at 12:00

subgroup identification with regression tree

2 Answers2