60

I have a pandas data frame which has some rows and columns. Each column has a header. Now as long as I keep doing data manipulation operations in pandas, my variable headers are retained. But if I try some data pre-processing feature of Sci-kit-learn lib, I end up losing all my headers and the frame gets converted to just a matrix of numbers.

I understand why it happens because scikit-learn gives a numpy ndarray as output. And numpy ndarray being just matrix would not have column names.

But here is the thing. If I am building some model on my dataset, even after initial data pre-processing and trying some model, I might have to do some more data manipulation tasks to run some other model for better fit. Without being able to access column header makes it difficult to do data manipulation as I might not know what is the index of a particular variable, but it's easier to remember variable name or even look up by doing df.columns.

How to overcome that?

EDIT1: Editing with sample data snapshot.

    Pclass  Sex Age SibSp   Parch   Fare    Embarked
0   3   0   22  1   0   7.2500  1
1   1   1   38  1   0   71.2833 2
2   3   1   26  0   0   7.9250  1
3   1   1   35  1   0   53.1000 1
4   3   0   35  0   0   8.0500  1
5   3   0   NaN 0   0   8.4583  3
6   1   0   54  0   0   51.8625 1
7   3   0   2   3   1   21.0750 1
8   3   1   27  0   2   11.1333 1
9   2   1   14  1   0   30.0708 2
10  3   1   4   1   1   16.7000 1
11  1   1   58  0   0   26.5500 1
12  3   0   20  0   0   8.0500  1
13  3   0   39  1   5   31.2750 1
14  3   1   14  0   0   7.8542  1
15  2   1   55  0   0   16.0000 1

The above is basically the pandas data frame. Now when I do this on this data frame it will strip the column headers.

from sklearn import preprocessing 
X_imputed=preprocessing.Imputer().fit_transform(X_train) 
X_imputed

New data is of numpy array and hence the column names are stripped.

array([[  3.        ,   0.        ,  22.        , ...,   0.        ,
          7.25      ,   1.        ],
       [  1.        ,   1.        ,  38.        , ...,   0.        ,
         71.2833    ,   2.        ],
       [  3.        ,   1.        ,  26.        , ...,   0.        ,
          7.925     ,   1.        ],
       ..., 
       [  3.        ,   1.        ,  29.69911765, ...,   2.        ,
         23.45      ,   1.        ],
       [  1.        ,   0.        ,  26.        , ...,   0.        ,
         30.        ,   2.        ],
       [  3.        ,   0.        ,  32.        , ...,   0.        ,
          7.75      ,   3.        ]])

So I want to retain the column names when I do some data manipulation on my pandas data frame.

Baktaawar
  • 7,086
  • 24
  • 81
  • 149
  • A sample of the Pandas code might be more useful. Doesn't Pandas provide a way of extracting the data from a frame, and then replace it with a new copy? – hpaulj Apr 12 '15 at 16:22
  • 3
    @Manish : please provide a very simple, reproducible example! A three row dataframe would make your question more understandable. (Maybe just copying `saved_cols = df.columns` and then reassigning it to the modified `df` would do the trick, but I'm not sure that's what you need) – cd98 Apr 13 '15 at 03:17
  • 2
    Indeed, as @cd98 says, copying `saved_cols = df.columns` and then when you got the series, doing `pandas.DataFrame(series, saved_cols)` you get your dataframe back. I do it for example when using `train_test_split`, which gives back a `numpy ndarray`, but I need to use it as a dataframe. It is not something to be particularly proud of, but in my opinion is good enough. – lrnzcig Apr 13 '15 at 09:31
  • 2
    @lrnzcig which version is that? I thought that worked for train_test_split in 0.16. – Andreas Mueller Apr 13 '15 at 23:36
  • @AndreasMueller indeed I've upgraded to 0.16 and no need to do it anymore for train_test_split. Thanks. – lrnzcig Apr 14 '15 at 09:22
  • Could someone help on this? How to retain header names of dataset when run through scikit learn? – Baktaawar Apr 19 '15 at 04:12

5 Answers5

81

scikit-learn indeed strips the column headers in most cases, so just add them back on afterward. In your example, with X_imputed as the sklearn.preprocessing output and X_train as the original dataframe, you can put the column headers back on with:

X_imputed_df = pd.DataFrame(X_imputed, columns = X_train.columns)
selwyth
  • 2,417
  • 16
  • 19
  • Thank u very much for this answer ... I was stuck on the same issue and your answer solved my problem. – gaurus May 07 '16 at 14:51
  • 30
    What If my preprocessing step is feature sellection ? Say, I have 1000 columns and after preprocessing (sklearn.feature_selection.SelectPercentile) It returns only 100 columns. How will I know what are the columns that are removed and that are not removed – Supreeth Meka Aug 27 '16 at 04:08
  • 1
    @SupreethMeka did you ever figure this out? – Drew Szurko Jun 15 '17 at 18:47
  • Appreciate this! –  Sep 24 '17 at 21:21
  • 13
    Use the [get_support method](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile.get_support). `X_selected_df = pd.DataFrame(X_selected, columns=[X_train.columns[i] for i in range(len(X_train.columns)) if feature_selector.get_support()[i]])` – selwyth Oct 19 '17 at 22:53
  • 3
    You can also add the index. `pd.DataFrame(data = transformed_data), columns = train_data.columns, index = train_data.index` – negas Mar 08 '19 at 17:22
  • This will NOT work if you are using an imputer, which rearranges columns after transformation (so different order than X_train.columns). Ouch!! – Paul Jun 23 '22 at 05:19
6

The above answers still do not resolve the main question. There are two implicit assumptions here

  1. That all the features of the dataset will be retained which might not be true. E.g. some kind of feature selection function.
  2. That all the features will be retained in the same order, again there might be implicit sorting in some feature selection transformations.

There is a "get_support()" method in at least some of the fit and transform functions that save the information on which columns(features) are retained and in what order.

You can check the basics of the function and how to use it here ... Find get_support() function description here

This would be the most preferred and official way to get the information needed here.

5

According to Ami Tavory's reply here, per documentation, Imputer omits empty columns or rows (however you run it).
Thus, before running the Imputer and setting the column names as described above, run something like this (for columns):

X_train=X_train.dropna(axis=1, how='all')

df.dropna described here.

AChervony
  • 663
  • 1
  • 10
  • 15
  • I assume that your suggestion is to do this also in the `predict` stage. this would cause a bug if in the `predict` these columns are not empty – ihadanny Jan 01 '20 at 13:46
  • better use this: `selected_cols[~pd.isnull(self.model_.steps[0][1].statistics_)]` – ihadanny Jan 02 '20 at 06:52
1

scikit-learn has a get_feature_names() method. This idea is borrowed from here.

from sklearn import preprocessing as pp

poly = pp.PolynomialFeatures(3, interaction_only=False, include_bias=False)

poly.fit(X_train)

X_test_new=pd.DataFrame(poly.transform(X_test), columns=poly.get_feature_names(X_test.columns))
X_test_new.head()
Jane Kathambi
  • 695
  • 6
  • 8
0

Adapted from part of the intermediate machine learning course on Kaggle:

from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X = pd.DataFrame(my_imputer.fit_transform(X))

# Imputation removed column names; put them back
imputed_X.columns = X.columns