2

I am new to Stack Overflow and I am also new to R and statistics. I need to create a linear regression model to describe the weight of a car based on some variables in a given dataset.

wtlm=lm(weight~foreign + cylinders + displacement + hp + acceleration, data=HW2_CarData);

summary(wtlm)

I'm not sure exactly how to conduct statistical tests with this model because I'm not sure if this "wtlm" describes the proper LR equation of weight = B1X1 + B2X2 + ... + Error.

Can someone help me fill in the gap between this and doing the statistical test? I need to do a test to determine whether domestic cars are heavier than foreign cars (probably by using the binary variable 'foreign'). If it were outside of R, I would try to divide the cars into two groups: 1 group of only American cars and 1 group of only foreign cars, and then try to do a statistical test for comparing two samples from two different populations.

I have read many help pages on using 'lm' in R but it doesn't quite help me with this question.

Also, I'm curious about the difference between lm(weight~foreign + cylinders + ...) vs lm(formula= ...)

If anyone can explain that, that would be really helpful too!

jkdev
  • 11,360
  • 15
  • 54
  • 77
Kim
  • 21
  • 2
  • I'd say it is more of a statistics related question. You'd find more help on Cross Validated forum. Stack Overflow in essence works like this: you have clear input and you have a clear idea of what should be the output and the missing link is the code itself. In your case missing link is statistical analysis / interpretation of linear model output and methods of conducting research... – statespace Mar 12 '15 at 08:44
  • Indeed, a simple t-test would be recommended to compare the two groups in the hypothesis that the groups are both representative of the whole car population. However, by suggesting including all the other variables you might want to "control" for these other parameters and see if the influence of "foreign" remains significant. And yes, the summary() function will give you the proper test results. Whatever you choose, you still have to post-check the validity of the hypotheses. – agenis Mar 12 '15 at 08:44
  • Maybe you'll find `anova(wtlm)` easier to interpret. – Roland Mar 12 '15 at 08:47

3 Answers3

1

Using summary(wtlm), you will get the B estimate of "foreignness" of cars on the weight. The t (test value) and its associated p-value are both part of what we refer to as "hypothesis tests". So if p < .05 (traditionnaly), it means that yes, foreignness, given this variable is binary, has a statistically significant "effect" on weight. To know the extent of the effect, you can use confint(wtlm) which will give you the 95% confidence interval of this effect. (The units reflect your dependant variable's units; if it's Kilograms, you'll know that foreign cars, in average, have a "Beta" Kilograms difference with non-foreign cars, holding all other parameters constant)

And yes, this correctly represents the LR model with error. As for the formula=, it is not mandatory; adding it doesn't change a thing. It would if you'd use other arguments before it. Read about order of arguments in R functions to know more.

Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
  • What do you mean by "test value"? Is this the same as the p value? The summary function yields a table with column "t value" and another column "Pr(|>t|)" and I am not sure how to interpret either of these columns. – Kim Mar 17 '15 at 07:38
  • Sorry, that wasn't clear. When you say you want to "test", generally we think "correlation test" or "t-test" or something like that. All of those include basically 2 values: the statistic itself (an f, or t statistic for instance), plus a p-value --rigourously speaking, the probability of observing the results we have (or results further away from 0) in our sample, given Ho True. When doing a regression, we do simultaneously several of those tests. By that, we mean "checking whether the t statistic allows one to say that results are statistically significantly different from 0"). – Dominic Comtois Mar 17 '15 at 08:50
  • I rephrased the beginning of my answer, hoping it's clearer now. On a more technical note, the `t` value you see from the `summary` is merely the `Estimate` (often called Beta) divided by its `Std. error.` – Dominic Comtois Mar 17 '15 at 08:56
  • Thank you, Dominic. This strays from the original question but what can we say about the value of "t" besides what is described by the p-value? For example, are we comparing the magnitude of "t" to some critical value in the t distribution for a certain confidence interval? Do we need to pay attention to the sign of "t"? I'm not sure what the interpretation is. – Kim Mar 18 '15 at 21:48
  • If you were to do things "by hand", the old-fashioned way, yes, you'd be looking for a critical value in a table of the _t_ distribution (taking into account your desired _alpha level_ and _degrees of freedom_, and check if your _t_ is larger than this value (in absolute). Now, you don't need to do that anymore. R doesn't tell you the critical value, but when your _p-value_ is < .05, you know that it is above the critical value, whichever what it might be. – Dominic Comtois Mar 19 '15 at 00:17
  • On the other hand, we less and less pay attention to _p-values_ because they'll tend to be smaller whenever you have larger samples. It's best to look at Confidence Intervals, so you have an estimate of the "range" within which the effect of your variable would be, 95% of the times you do an experiment like you did. – Dominic Comtois Mar 19 '15 at 00:18
0

The example that you have mentioned, you don't really need to do linear regression for that.

I need to do a test to determine whether domestic cars are heavier than foreign >cars (probably by using the binary variable 'foreign').

let me give you an example. Here i am testing whether variable "wt" has different means across groups defined by "am" [which is binary].

data(mtcars)
t.test(wt~am,data=mtcars)
  • This is misleading. They should look at the p-value behind `foreign` in their `summary` output. (And of course also consider possible interactions...) – Roland Mar 12 '15 at 08:45
0

I respectfully disagree with all of the t-test-like answers above. The OP mentions he is interested in the difference in weight between domestic and foreign cars and wants to determine weight:

"...based on some variables in a given dataset"

The questions is thus about weight differences across domestic and foreign cars, controlled for other car characteristics. A t-test does not allow for that, while regression (or anova) does.

Let's use the mtcars dataset and assume that V-shaped are US-engines (VS == 0) and S-shaped are European ('foreign') engines (VS == 1).

df <- mtcars
m1 <- lm(formula = wt ~ vs, data = mtcars)
summary(m1)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.6886     0.1950  18.913  < 2e-16 ***
vs           -1.0773     0.2949  -3.654  0.00098 ***

The abrigded output shows that, when not controlling for other characteristics, European cars weigh on average less (3.6886+1*-1.0773) than US cars (3.6886+0*-1.0733).

However this difference may well be attributable to difference in how European / US cars are made. E.g. US cars may be more likely to be automatic rather than manual and may have on average more gears and carburettors than European cars, all contributing to the weight of a car. Let's model these factors in and see whether the US/European difference in weight still exists.

m2 <- lm(formula = wt ~ am + as.factor(carb) + as.factor(gear) + vs, data = mtcars)
summary(m2)
Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        3.5658     0.4283   8.325 3.03e-08 ***
am                -0.8585     0.4378  -1.961   0.0627 .  
as.factor(carb)2   0.1250     0.3871   0.323   0.7499    
as.factor(carb)3   0.2942     0.5257   0.560   0.5813    
as.factor(carb)4   0.9034     0.4714   1.916   0.0684 .  
as.factor(carb)6   0.7693     0.7966   0.966   0.3446    
as.factor(carb)8   1.5693     0.7966   1.970   0.0615 .  
as.factor(gear)4  -0.4427     0.5015  -0.883   0.3869    
as.factor(gear)5  -0.7066     0.6228  -1.135   0.2688    
vs                -0.3322     0.4237  -0.784   0.4413

The last line in the abridged output now shows that differences in weight can no longer be attributed to US or European make, once car characteristics are taken into account. It also illustrates nicely how this answer differs substantively from the recommended t-test (or single variable regression in model m1).

"Also, I'm curious about the difference between lm(weight~foreign + cylinders + ...) vs lm(formula= ...)"

There is no substantive difference. The former is short hand for the latter. However, when using the short hand notation the elements (formula, data, etc) must be provided in the expected order (see ?lm). .

Richard
  • 1,224
  • 3
  • 16
  • 32