8

Is there a way to check for linear dependency for columns in a pandas dataframe? For example:

columns = ['A','B', 'C']
df = pd.DataFrame(columns=columns)
df.A = [0,2,3,4]
df.B = df.A*2
df.C = [8,3,5,4]
print(df)

   A  B  C
0  0  0  8
1  2  4  3
2  3  6  5
3  4  8  4

Is there a way to show that column B is a linear combination of A, but C is an independent column? My ultimate goal is to run a poisson regression on a dataset, but I keep getting a LinAlgError: Singular matrix error, meaning no inverse exists of my dataframe and thus it contains dependent columns.

I would like to come up with a programmatic way to check each feature and ensure there are no dependent columns.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
Geoff Perrin
  • 444
  • 1
  • 6
  • 14
  • You should be able to achieve what you need with `numpy` and this post: https://stackoverflow.com/questions/28816627/how-to-find-linearly-independent-rows-from-a-matrix – Easton Bornemeier Jun 14 '17 at 22:38

1 Answers1

9

If you have SymPy you could use the "reduced row echelon form" via sympy.matrix.rref:

>>> import sympy 
>>> reduced_form, inds = sympy.Matrix(df.values).rref()
>>> reduced_form
Matrix([
[1.0, 2.0,   0],
[  0,   0, 1.0],
[  0,   0,   0],
[  0,   0,   0]])

>>> inds
[0, 2]

The pivot columns (stored as inds) represent the "column numbers" that are linear independent, and you could simply "slice away" the other ones:

>>> df.iloc[:, inds]
   A  C
0  0  8
1  2  3
2  3  5
3  4  4
MSeifert
  • 145,886
  • 38
  • 333
  • 352