0

I have used the lm for my multiple regression analysis. and then used GVLMA for Assumption test, where the results showed that Global Stat and Heteroskedasticity tests were not satisfied.

the form of the code is as follows: (all variables are continuous)

model_1 <- lm (y ~ x1 + x2, data = abc)

Then I have run one more model with the same variables (thinking that I must introduce interaction terms fix the GVLMA assumptions)

model_2 <- lm (y ~ x1 + x2, x1 * x2, data = abc)

With this model_2, all the assumptions are satisfied. But when I checked I have realised the way interaction terms introduced was not accurate. I can't see what that 'comma' does here between the variables?

I am in a difficult situation as the model is fitting well, but I cannot explain what , x1 * x2 does in the equation / results?

Please help me to understand.

Piotr K
  • 943
  • 9
  • 20

1 Answers1

0

With linear models the interaction term is defined by : and terms are separated by a +, so a model with the single and interaction terms is

lm(y ~ x1:x2 + x1 + x2)

However, you can write x1*x2 which includes by the interaction and single effects so the following is equivalent to the above

lm(y ~ x1*x2)

See what happens when using the built in dataset iris, where the fixed effects are specified as Petal.Width*Sepal.Length, all three terms are in the model summary:

Call:
lm(formula = Petal.Length ~ Petal.Width * Sepal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.99588 -0.24329  0.00355  0.29735  1.24780 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -3.24804    0.59586  -5.451 2.08e-07 ***
Petal.Width               2.97115    0.35836   8.291 6.74e-14 ***
Sepal.Length              0.87551    0.11667   7.504 5.60e-12 ***
Petal.Width:Sepal.Length -0.22248    0.06384  -3.485  0.00065 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3888 on 146 degrees of freedom
Multiple R-squared:  0.9525,    Adjusted R-squared:  0.9515 
F-statistic: 975.4 on 3 and 146 DF,  p-value: < 2.2e-16

As to what the comma is doing in your models, it is creating a subset. Compare the summary of the following three models: the first have 146 and 147 degrees of freedom - they have have 150 data points and estimate 4 and 3 parameters each. The third model, one that mimics your specification, has 129 degrees of freedom - that's what made me realise it was subsetting. Checking the documentation for lm(), there is an argument for subsetting: lm(formula, data, subset, ...). Because data is specified explicitly, the unspecified arguments default to formula and subset. You can also see that in the model summary, which shows a subset in the model call.

summary(lm(Petal.Length ~ Petal.Width * Sepal.Length, data = iris))
summary(lm(Petal.Length ~ Petal.Width + Sepal.Length, data = iris))
summary(lm(Petal.Length ~ Petal.Width + Sepal.Length, Petal.Width * Sepal.Length, data = iris))

Your result can be recreated by passing this vector, iris$Petal.Width * iris$Sepal.Length, as row numbers - so be careful, that's resuing some rows a lot and skipping a lot too so the result of this model doesn't match one that use all the data (and each data point only once).

summary(lm(Petal.Length ~ Petal.Width + Sepal.Length, data = iris[iris$Petal.Width * iris$Sepal.Length, ]))
rg255
  • 4,119
  • 3
  • 22
  • 40
  • Thanks for the response. When I use lm (y ~ x1*x2), the gvlma results shows Global Stat assumptions not acceptable...Also VIF is very high....Hence I have used lm (y ~ x1 + x2, x1 * x2), then everything is passing...but I am not sure if this lm method is correct, and also dont know how to explain 'comma' – Vinod Kumar J S May 10 '20 at 23:43
  • My specific query was what , x1 * x2 is doing in the lm equation? – Vinod Kumar J S May 11 '20 at 12:58