Yes, I believe it will matter, as sklearn
will convert the pandas
DataFrame to an array of values (essentially calling X1.values
), and not pay attention to the column names. However, it's an easy fix. Just use:
X2 = X2[X1.columns]
And it will re-order X2
's columns to the same order as X1
The same is true of numpy
arrays, of course, because it will fit the model on the columns as they are in X1
, so when you predict on X2
, it will just predict based on the order of the columns in X1
Example:
Take these 2 dataframes:
>>> X1
a b
0 1 5
1 2 6
2 3 7
>>> X2
b a
0 5 3
1 4 2
2 6 1
The model is fit on X1.values
:
array([[1, 5],
[2, 6],
[3, 7]])
And you predict on X2.values
:
>>> X2.values
array([[5, 3],
[4, 2],
[6, 1]])
There is no way for the model to know that the columns are switched. So switch them manually:
X2 = X2[X1.columns]
>>> X2
a b
0 3 5
1 2 4
2 1 6