0

To the statistic experts out there: I am getting a headache by thinking about the interpretation of a regression.

If you test anomalies you can to this by using a dummy variable D in the regression. Let's say you want to find out if a specific day reacts not normal. Because we have the feeling we make more money on Fridays. The regression looks like:

Return/Earnings = a + b1 DMonday + b2 DTuesday + b3 DWednesday + b4 DThursday + b5 DFriday + e

For sure your earnings depend on other things, such as number of customers, price level, weather, ...who knows...

Let's say b5 has got a p-value close to zero. But R2 is zero as well. How can I interpret this result?

Saying the whole model can't forecast earnings because R2 is zero!? Makes sense to me. On the other hand I can say Friday is signifcantly better than other days. Makes sense as well to me.

But I don't understand why, if Friday is significant, the whole model has a R2 close to zero. I know that some people use ANOVA and Kruskal Wallis. But I know that Regressions are used often. I just don't get the idea behind it. Any interpretation would be very much appreciated.

PS: add on - why do some people drop the monday in the regression? Okay, the monday works as a reference in this case. I can understand this. But what's the advantage of doing this? Isn't the result the same?

Poldi
  • 29
  • 3

1 Answers1

0

Summary

  • Having a low p-value implies statistical significance, which in this case is implying a linear correlation between the predictor variable and target variable
  • R2 score measures the models ability to precisely measure the target variable given the values from the predictors
  • It is possible to have a low p-value and low r2 value as they measure different metrics in linear regression

PS: add on - why do some people drop the monday in the regression? Okay, the monday works as a reference in this case. I can understand this. But what's the advantage of doing this? Isn't the result the same?

Monday is used as a reference and is redundant. It also reduces reduces the complexity of the analysis by removing one variable. There's a nicer post about dummy variables you can read here

The Notes

The p-value of a linear regression model checks if there is a significant linear relationship or correlation between your predictors (in this example the days Monday to Friday) and the target variable (Return/Earnings). If the p-value is low, this means its relationship is significant. This means the response variable increases or decreases as the coefficient or predictor variables increase or decrease.

If the r2 score is close to 0, it means the model is unable to explain most the variability of your data. This means on average, when you predict the value of return/earnings using any values, it is not likely to be close to the line of best fit.

Let's say b5 has got a p-value close to zero. But R2 is zero as well. How can I interpret this result?

b5 having a p-value close to zero means Friday has a significant linear relationship with the target variable earnings. This does not necessarily imply we are able to precisely predict our values.

For example, refer to the images below collected from here

enter image description here enter image description here

We can clearly see, thought there is significant linear correlation between the variables, the R2 value is low for the plot with points scattered around the line of best fit compared to the second plot with points more centred around the line of best fit.

  • Thank you very much for your response. I should have been more precise, sorry for that. R2 is very close to 0, not even 0,00001, but below. Thus there is no explanation power of the model. But the p-value way below 0,001. Why is R2 not higher, if the Dummy is that significant? Or why is the model useful? Did I prove the dummy is useful although it does not help to predict something? In the end it just says that Friday is on average higher than other days, but it can't say which value, right? – Poldi Nov 06 '20 at 12:45
  • If the p-value for Friday is significant (p-value is very low, close to 0) it means we are very sure if Friday is selected, the coefficient for Friday will increase or decrease the earnings by that amount (depending if the coefficient is positive or negative) – Anthony Inthavong Nov 06 '20 at 15:17
  • R2 is not higher most likely because the other days do not help precisely (close to the line of best fit) predict the resulting target variable (earnings) – Anthony Inthavong Nov 06 '20 at 15:18
  • The dummy being significant doesnt imply R2 will be high, just that it follows a linear trend – Anthony Inthavong Nov 06 '20 at 15:19
  • The model is useful, in the inferential sense because you can say which variables have a linear trend or correlation with the target variable. In this case because Friday is significant (very low p-value) you can say when Friday is active, it is likely the target value will then increase or decrease its value by Friday's coefficient – Anthony Inthavong Nov 06 '20 at 15:21
  • But as a prediction model, since it is not very precise having a very low R2 score (close to 0) it would not be recommended to be used to predict future or potential earnings if you needed a very precise answer – Anthony Inthavong Nov 06 '20 at 15:22
  • Cool, thank you very much Anthony. In other words, just to see if I understood you correctly: I can say that 1) the model does not predict the earnings at all but 2) due to the significant Friday I can say that a Friday is on average higher than other days by the amount of the coefficent. We just dont know the absolute value. Right? – Poldi Nov 06 '20 at 21:13
  • Hi Poldi, yes thats correct. 1) The model should not be used as a prediction model because it does not accurately predict the earnings for the given days and 2) Friday having a low p-value suggests earnings will increase by Friday's coefficient on average. But I am unsure by what you mean by absolute value. – Anthony Inthavong Nov 06 '20 at 21:36
  • You're my hero, thx Anthony. I just meant that we don't know anything about the earinings as a total. We just know, that Friday is higher by the coefficent on average. On the other hand the intercept reflects the normal day if I delete Monday in the regression equation I guess. Thus we have the total average as well by adding the intercept and the coefficient I think. But maybe I am wrong. – Poldi Nov 06 '20 at 22:49