8

We're trying to come up with a way for an R function to handle a model which has multiple responses, multiple explanatory variables, and possibly shared parameters between the responses. For example:

Y1 ~ X1 + X2 + X3
Y2 ~ X3 + X4

specifies two responses and four explanatory variables. X3 appears in both, and we want the user to control whether the associated parameter value is the same or different. ie:

Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b3 X3 + b4 X4

which is a model with four 'b' parameters, or

Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b4 X3 + b5 X4

a model with five parameters.

Two possibilities:

  • Specify all the explanatory variables in one formula and supply a matrix mapping responses to explanatories. In which case

Foo( Y1+Y2 ~ X1 + X2 + X3 + X4 + X5, map=cbind(c(1,1,1,0),c(0,0,1,1)))

would correspond to the first case, and

Foo( Y1+Y2 ~ X1 + X2 + X3 + X4 + X5, map=cbind(c(1,1,1,0,0),c(0,0,0,1,1)))

would be the second. Obviously some parsing of the LHS would be needed, or it could be cbind(Y1,Y2). The advantage of this notation is that there is also other information that might be required for each parameter - starting values, priors etc - and the ordering is given by the ordering in the formula.

  • Have multiple formulae and a grouping function that just adds an attribute so shared parameters can be identified - the two examples then become:

Foo( Y1 ~ X1+X2+G(X3,1), Y2 ~ G(X3,1)+X4)

where the X3 parameter is shared between the formula, and

Foo( Y1 ~ X1+X2+X3, Y2 ~ X3+X4)

which has independent parameters. The second parameter of G() is a grouping ID which gives the power to share model parameters flexibly.

A further explanation of the G function is shown by the following:

Foo( Y1 + X1+X2+G(X3,1), Y2~G(X3,1)+G(X4,2), Y3~G(X3,3)+G(X4,2), Y4~G(X3,3))

would be a model where:

Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b3 X3 + b4 X4
Y3 = b5 X3 + b4 X4
Y4 = b5 X3

where there are two independent parameters for X3 (G(X3,1) and G(X3,3)). How to handle a group that refers to a different explanatory variable is an open question - suppose that model had Y4~G(X3,2) - that seems to imply a shared parameter between different explanatory variables, since there's a G(X4,2) in there.

This notation seems easier for the user to comprehend, but if you also have to specify starting values then the mapping between a vector of starting values and the parameters they correspond to is no longer obvious. I suspect that internally we'd have to compute the mapping matrix from the G() notation.

There may be better ways of doing this, so my question is... does anyone know one?

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Will you need to account for more complicated model specifications like these: `Y3 = b5 X3; Y4 = 2*b5 X3`, or `Y3 = b5 X3; Y4 = b5 X4`? (Very interesting question, BTW.) – Josh O'Brien Sep 25 '12 at 16:11

2 Answers2

1

Interesting question (I wish all package authors worried a lot more in advance about how they were going to create extensions to the basic Wilkinson-Rogers formula notation ...)

How about something like

formula=list(Y1~X1+X2+X3,Y2~X3+X4,Y3~X3+X4,Y4~X3),
   shared=list(Y1+Y2~X3,Y2+Y3~X4,Y3+Y4~X3)

or something like that for your second example above?

The formula component gives the list of equations.

The shared component simply lists which response variables share the same parameter for specified predictor variables. It could obviously be mapped into a logical or binary table, but (for me at least -- this is certainly in the eye of the beholder) it's more straightforward. I think the map solution above is awkward when (as in this case) a variable (such as X3) is shared in two distinct sets of relationships.

I guess some straightforward rule like "starting values in the order in which the parameters appear in the list of formulas" -- in this case

X1, X2, X3(1), X4, X3(2)

would be OK, but it might be nice to provide a helper function that would tell the users the names of the coefficient vector (i.e. the order) given a formula/shared specification ...

From a bit of personal experience, I would say that embedding more fanciness in the formula itself leads to pain ... for example, the original nlme syntax with the random effects specified separately was easier to deal with than the new lme4-style syntax with random effects and fixed effects mixed in the same formula ...

An alternative (which I don't like nearly as well) would be

 formula=list(Y1~X1+X2+X3,Y2~X3+X4,Y3~X3[2]+X4,Y4~X3[2])

where new parameters are indicated by some sort of tag (with [1] being implicit).

Also note suggestion from the chat room by @Andrie that interfaces designed for structural equation modeling (sem, lavaan packages) may be useful references.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • +1, and +5 if I could because this is so definitely the way I would go. Storing the constraints in a separate argument has a nice orthogonality, makes the formulae in both arguments easier to read, and may even make the programming a bit more straightforward. I have the same preference for `nlme` vs `lme4`-style syntax, and even found lattice's conditioning variable (i.e. `|`) harder to pick up than its named `groups=` argument. If you can at all help it, you really don't want user to have to learn any new syntax/mini-language to use your package. – Josh O'Brien Sep 25 '12 at 16:04
  • Yeah, we like this answer. It might have to wait for v2 of the package though, since doing all this shared parameter business might take a while - v1 will just have independent parameters and one big fat prior (I think...). Its not me writing it... – Spacedman Sep 27 '12 at 15:31
0

Of the two methods you propose, the second one with the idea of several formulae looks more natural, but the G notation makes no sense to me.

The first one is much easier to understand, but I have two suggested tweaks to the map argument.

  1. It should really take logical values rather the numbers.

  2. Consider having a default of including all the independent variables for each response variable.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360