3

I am having dataset which contains categorical attribute state which can take New York, California and Florida.

  • After encoding these values in dummy variables why we need to drop one variable?
  • Can someone explain me what is dummy variable trap situation in linear regression.
  • Why we need to drop 1 variable to come out of the situation?
jose_bacoy
  • 12,227
  • 1
  • 20
  • 38
Chirag Jain
  • 39
  • 1
  • 4
  • I need an theoretical explanation. Why we need to drop one variable? – Chirag Jain Mar 10 '18 at 17:28
  • It's called the dummy variable trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. Therefore one variable is dropped. (taken from here: https://www.algosome.com/articles/dummy-variable-trap-regression.html) – msarafzadeh Mar 29 '19 at 11:21

2 Answers2

2

This is not always necessary, but the idea is that if the categorical attribute covers all the space (i.e. your dummy variables represent all the possible values for the attribute), then the last dummy variable can be perfectly predicted by the other N-1 dummies:

last_dummy = 1 if all sum(dummies[:N-1]) == 0 else 0

This introduces a heavy collinearity between your dummy variables (which is a very undesirable thing in linear/logistic regression) and that's why it is called dummy variable trap.

Usually, the way of fixing this this problem is to just remove the one dummy column (any would do, it does not have to be the last one). This removes the source of collinearity and, since the dummy could be predicted by the rest anyway, there is no loss of information at all from the original dataset.

carrdelling
  • 1,675
  • 1
  • 17
  • 17
  • Thank you. So, if I include all the dummy variables in the equation then is it anyhow related to the constant term? I read somewhere that the constant term and all dummy variables can't be taken together in a linear equation. – Chirag Jain Mar 11 '18 at 04:24
  • No, not really. The constant term (the bias) should be independent to the rest of your variables. The problem here is that each dummy variable can be predicted by the others, so you need to take one out to "break the loop". – carrdelling Mar 11 '18 at 10:18
2

you always need to drop one Dummy variable per level because of the intercept Lets say you have 7 dummy variable for day of the week The reference will be Monday compared to the others

If you remove the intercept, then you can add Monday. But removing intercept is done only in very specific case

Neon67
  • 41
  • 1
  • 9
  • Why we can't have all variables with intercept? Any specific reason for that? – Chirag Jain Mar 11 '18 at 09:11
  • 2
    The model is unsolvable when you have perfect collinearity. The matrix is singular/degenerate when the intercept is perfectly collinear with the sum of the categorical variables. – Darren Nov 06 '18 at 21:32