0

I have a regression problem at hand and know that the error is caused by multicollinearity of input variables, but I am having a hard time to find how to identify those extra variables and remove them from the regression model to make it work.

Here is a simple example but in my case it is more than highly correlated variables (it can be x3 = x1 + x2):

import numpy as np

#input array A
A =np.array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 8, 10, 12]])

#output array b
b = np.array([ 22.,   7.,  14.])

# check the rank and input array and find it is not full rank
np.linalg.matrix_rank(A)

#will return "LinAlgError: Singular matrix"
np.linalg.solve(A,b)

# will return result without error
np.linalg.lstsq(A,b)

However, in this case, I would like to first remove X3 since it is correlated with X2 and using X1 and X2 to fit with b.

Any idea on how to reduce the rank of A to its full rank and remove extra variables before fit into the model?

Thanks!

Kexin Xu
  • 691
  • 3
  • 10
  • 20

0 Answers0