Using R to interpret a symbolic formula for outside use

Question

In R, the formula object is symbolic and it seems rather hard to parse. However, I need to parse such a formula into an explicit set of labels for use outside of R.

(1)

Letting f represent the model formulae in which a response is not specified, e.g. ~V1 + V2 + V3, one thing I tried was:

t <- terms(f)
attr(t, "term.labels")

However, this doesn't get what is exactly explicit if some of the variables in f are categorical. For example, let V1 be a categorical variable with 2 categories, i.e. a boolean, and let V2 be a double.

Therefore, a model that is specified by ~V1:V2 should have 2 parameters: "intercept" and "xyes:z". Meanwhile, a model that is specified by ~V1:V2 - 1 should have parameters "xno:z" and "xyes:z". However, without a way of telling the function terms() which variables are categorical (and how many categories) is has no way of being able to interpret these. Instead, it just has V1:V2 in its "terms.labels" which doesn't mean anything in the context that V1 is categorical.

(2)

On the other hand, using model.matrix is an easy way to get exactly what I want. The problem is that it requires a data argument, which is bad for me because I only want an explicit interpretation of the symbolic formula for use outside of R. This method of getting that will waste a lot time (comparatively) because R has to read the data from an outside source when all it really needs to know for the formula is which variables are categorical (and how many categories) and which variables are doubles.

Is there any way to use 'model.matrix' with only specifying the types of data, rather than the actual data? If not, what else is a viable solution?

score 4 · Accepted Answer · answered May 16 '13 at 16:51

Well, it is only in the context of having data that it can be determined whether a given variable is a factor or numeric. So you can't do it without the data argument. But all you need is the structure, not the data itself, so you can use a 0-row data frame with the columns of all the right types.

f <- ~ V1:V2
V1numeric <- data.frame(V1=numeric(0), V2=numeric(0))
V1factor <- data.frame(V1=factor(c(), levels=c("no","yes")), V2=numeric(0))

Looking at the two data.frames:

> V1numeric
[1] V1 V2
<0 rows> (or 0-length row.names)
> str(V1numeric)
'data.frame':   0 obs. of  2 variables:
 $ V1: num 
 $ V2: num 
> V1factor
[1] V1 V2
<0 rows> (or 0-length row.names)
> str(V1factor)
'data.frame':   0 obs. of  2 variables:
 $ V1: Factor w/ 2 levels "no","yes": 
 $ V2: num

Use model.matrix with these

> model.matrix(f, data=V1numeric)
     (Intercept) V1:V2
attr(,"assign")
[1] 0 1
> model.matrix(f, data=V1factor)
     (Intercept) V1no:V2 V1yes:V2
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$V1
[1] "contr.treatment"

If you have a real data set, it is easy to get a 0-row data.frame from that which retains the column information. Just subscript it with FALSE. You don't need to build the data.frame by hand if you have one with the right properties.

> str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
> str(mtcars[FALSE,])
'data.frame':   0 obs. of  11 variables:
 $ mpg : num 
 $ cyl : num 
 $ disp: num 
 $ hp  : num 
 $ drat: num 
 $ wt  : num 
 $ qsec: num 
 $ vs  : num 
 $ am  : num 
 $ gear: num 
 $ carb: num

Thanks for the quick reply. I do have a real data set, but it is large and outside R so reading it is what reduces efficiency. What you have in the first part looks great, but one thing I don't understand is that `model.matrix(f, data=V1Factor)` produces 3 relevant parameters, when it shouldn't have `V1no:V2` as there is an intercept. — Jon Claus, May 16 '13 at 17:02
R is known to be reluctant to remove lower order parameters when interactions are present, perhaps that is the reason: http://stackoverflow.com/q/11335923/289572 — Henrik, May 16 '13 at 17:05
@JonClaus I think it should have three parameters: the intercept, the slope of `V2` when `V1` is no, and the slope of `V2` when `V1` is yes (the way it is parametrized here). You could also get 3 parameters with an intercept, a slope of `V2` when `V1` is no, and the change in the slope of `V2` when `V1` changes from no to yes. However you parametrize it, there are 3 parameters. — Brian Diggs, May 16 '13 at 17:23
Also, you don't need to read the whole data set that is outside of R; just enough of it to get the structure. Or if you know the structure, you can just create that "by hand" and not have to read anything in. Whichever approach is easiest; both work. — Brian Diggs, May 16 '13 at 17:25
Alright, that shouldn't be too much of a problem to deal with outside of R once I have the explicit labels for parameters. Thanks. — Jon Claus, May 16 '13 at 17:25

Using R to interpret a symbolic formula for outside use

1 Answers1