4

I have dataset where observations are taken repeatedly at 72 different sites. I'm using a lmertree model from the GLMERTREE package with a random intercept, a treatment variable, and many "partitioning" variables to identify whether there are clusters of sites that have different relationships between the treatment and response variables depending on the partitioning variables. In most cases, the model does not split observations from the same site among the different terminal nodes, but in a few cases it does.

Is there some way to constrain the model to ensure that non-independent observations are included in the same terminal node?

Nick_89
  • 117
  • 2
  • 11

1 Answers1

4

The easiest way to do so is to consider only partitioning variables at site level. If these site variables are not constant across the observations at the site, it might make sense to average them for each site. For example, to average a regressor x2 to x2m in data d:

d$x2m <- tapply(d$x2, d$site, mean)[d$site]

If you have additional variables at observation level rather than at site level, it might make sense to include them (a) in the regression part of the formula so that the corresponding coefficients are site-specific in the tree, or (b) in the random effect part of the formula so that only a global coefficient is estimated. For example, if you have a single observation-level regressor z and two site-level regressors x1 and x2 that you want to use for partitioning, you might consider

## (a)
y ~ treatment + z | site | x1 + x2
## (b)
y ~ treatment | (1 + site) + z | x1 + x2

Finally, we discovered recently that in the case of having cluster-level (aka site-level) covariates with strong random effects it might make sense to initialize the estimation of the model with the random effect and not with the tree. The simple reason is that if we start estimation with the tree, this will then capture the random intercepts through splits in the cluster-level variables. We plan to adapt the package accordingly but haven't done so yet. However, it is easy to do so yourself. You simply start with estimating the null model for only the random intercept:

null_lmm <- lmer(y ~ (1 | site), data = d)

Then you extract the random intercept and include it in your data:

d$ranef <- ranef(null_lmm)$site[[1]][d$site]

And include it as the starting value for the random effect:

tree_lmm <- lmertree(y ~ treatment | site | x1 + x2 + ...,
  data = d, ranefstart = d$ranef, ...)

You can try to additionally cluster the covariances at site level by setting cluster = site.

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • Thanks for this helpful answer. I decided to include a global estimate for the non-constant partitioning variable. Also, I really appreciate the suggestion about initializing the model with the random effect, since I have noticed that many splits seem to be caused by differing intercepts rather than differing slopes. Just fyi, when I try to set cluster=site I get an error code: “Error in eval(extras, data, env) : ..1 used in an incorrect context, no ... to look in” – Nick_89 Mar 06 '18 at 17:17
  • Ah, sorry, we still need to properly export that option. At the moment you can only use it with `glmtree` but not `glmertree`. Marjolein and I will try to have a look soon. At least Marjolein made the initialization with the random effect easier. After `install.packages("glmertree", repos = "http://R-Forge.R-project.org")` you can simply do `glmertree(..., ranefstart = TRUE)`. – Achim Zeileis Mar 07 '18 at 00:14
  • Quick follow-up: Marjolein now fixed the handling of the `cluster` argument as well. If you re-install the package from R-Forge (see above), you can also do: `glmertree(..., ranefstart = TRUE, cluster = d$site)` – Achim Zeileis Mar 22 '18 at 23:41
  • 1
    Great, thanks for the update! Any chance you could explain why clustering covariances would be advantageous? As a non-statistician, it’s something I’m having trouble understanding. – Nick_89 Mar 23 '18 at 04:32
  • 1
    When computing the variances/ standard errors of the parameter estimates, this accounts for the fact that the clusters are independent but that observations within clusters are potentially correlated. – Achim Zeileis Mar 23 '18 at 06:42
  • Although the lmertree function seems to be running on my actual data with the cluster argument, it's consistently throwing errors saying that "rror in chol.default(meat) : the leading minor of order 1 is not positive definite" (only when cluster argument used) and "data contains missing observations, note that listwise deletion will be employed" (in all models even though no values are missing and cluster argument isn't used). I've also found that when no split occurs, tree$lmer no longer outputs a fixed effect for the covariate (only the intercept), so I can't calculate confint(tree$lmer...). – Nick_89 Mar 23 '18 at 18:45
  • Hmmm, weird, maybe you are using site as a fixed effect? What is your formula currently? – Achim Zeileis Mar 24 '18 at 08:10
  • I don't think that's the problem, site is only included as a random effect. The formula looks like this: lmertree( abundance ~ drought_index | Site.Code | wetland_area, ranefstart = T, data = dataset,cluster=dataset$Site.Code). No worries though, I think I'm going to standardize the variance between sites anyway, so I won't need to use the cluster argument. – Nick_89 Mar 24 '18 at 16:44