2

I have a data set consisting of about 10 independent variables. (1000 rows x 10 columns).

All of which i know will have a positive contribution to my target variable.

Once i run a multivariate linear regression on this, i have negative coefficients. Does this mean that attribute is supposedly having a negative contribution? Therefore my model is incorrect? (as they should all have a positive contribution?)

Any help appreciated. Thanks, J

J. Warrington
  • 95
  • 2
  • 5

4 Answers4

2

First, question how you know that the variables are all positive contributions. How do you support that statement? Second, how did you determine that the 10 variables are statistically independent?

If they are not truly independent, then it's possible to see this apparent contradiction. Although each of the ten may have a positive contribution, it's easy to build a case in which a combination over-contributes.

Consider a, b, and c, where a & c have a light positive correlation, and b has a higher correlation with each. If any one of them increases, the output increases. However, if all three of them increase, it's quite possible that a simple polynomial metric will increase too much from both a and c increasing; since b increases with both of them, giving it a negative coefficient can be used to balance that over-contribution. In other terms, since the "winning team" is far too strong, b defects to the opponents to keep the game properly balanced. :-)

Does that clarify the problem? Does it match the problem?

Prune
  • 76,765
  • 14
  • 60
  • 81
1

Your model is fine. It can have negative weights. They (weights) are more of the relative contributions. They shows how one feature has effect compare to other.

a negative weight should not be a problem. It means that the expected value on your dependent feature would be less than 0 when all independent features are set to 0. For some correlated features, it would be expected. For example, if the mean value of your correlated features is -ve, constant would be -ve; On the contrary, a +ve value here would be problematic.

If data's dependent features are always positive then also it can have a positive value. For example, consider an independent features that has a strongly positive correlation to a dependent feature.

The values of the dependent features are positive and have a range from 1-10,
The values of the independent features are positive and have a range from 200-210.

In this case,regression line can cross the x-axis between x=0 and x=200, which would result in a negative value for the constant.i.e., regression line can move from the first to the fourth quadrant

saurabh agarwal
  • 2,124
  • 4
  • 24
  • 46
1

The most likely cause is correlation between the variables because of the limited sample size and noise in the system. Only if you collect infinite data and then calculate correlation would it come to zero. The smaller the sample size the more the error in estimating correlation.

1) Try calculating the correlation of the variables with the 1000 examples. 2) My intuition is your negative weights should be pretty small as compared to the positive weights, as the sample size increase the likelihood of a negative weight decreases.

Just curious what are your 10 variables and how do you judge they are independent?

1

This happened to me. I had a positive correlation but negative weights in linear regression with no possible explanation, as data didn't present collinearity and this was not possible to rationalize in the explanation. It simply didn't make sense.

In my case, what was causing this issue was that Pandas dataframe index was messed. After I applied df.reset_index() I had an expected behaviour of variables and the problem was solved.

razimbres
  • 4,715
  • 5
  • 23
  • 50