Consider these various R formulas with interactions:
x ~ a + b + c + d + c:d + a:b
x ~ c + a + b + d + a:b + d:c
x ~ a + b + c + d + c * d + a * b
x ~ a * b + c * d
x ~ b * a + c * d
For purposes of something like a linear model, these are all equivalent. Let's say I had a big set of formulas, and I wanted to compare there were any duplicates, but there might be non-obvious duplicates like the above. Is there a simple way to do this kind of comparison?
There are three challenges:
- Have to remove redundancies (d + c * d is equivalent to c * d)
- Have to be able to match elements in different orders (a + b same as b + a)
- Have to be able to match commuted interactions (c:d is the same as d:c)
Just a terms() call with some sorting doesn't seem to get at it, mostly because of the last one.
Here's how I worked it out so far (written as a functional sequence for ease of reading):
# uses tidyverse
get.terms <- {
. %>%
terms %>% # use terms to get the parts
attr("term.labels") %>% # character vector of elements
str_split(":") %>% # separate interaction terms (makes list)
map_chr( # go through each list item
~.x %>%
sort %>% # if multiples (interaction), sort
paste0(collapse = ":") # combine back
) %>% # output (now standardized) term list
sort # sort the term list for comparison
}
# Which gives:
get.terms(x ~ a + b + c + d + c:d + a:b)
get.terms(x ~ c + a + b + d + a:b + d:c)
get.terms(x ~ a + b + c + d + c * d + a * b)
get.terms(x ~ a * b + c * d)
get.terms(x ~ b * a + c * d)
# so you can test:
all.equal(get.terms(x ~ b * a + c * d), get.terms(x ~ c + a + b + d + a:b + d:c))
# would have to add more for this, though:
all.equal(get.terms(foo ~ b * a + c * d), get.terms(bar ~ c + a + b + d + a:b + d:c))
But this seems hacky for such a fundamental part of R.
I realize that you could probably shorten this a bit with a list element comparison nearer the end, but the extra steps are intentional as the idea is to be able to constructing a standardized human-readable formula notation, too. It's more that the whole process, especially the interaction term flips, seems like it shouldn't be necessary
Anyone know an easier or more canonical way to do this?
Bonus points if it can incorporate potential left-hand-side differences as well.
Double bonus if it can output a standardized formula format (or string equivalent).