How to Combine Numeric and Categorical features in scikit-learn Pipelines?

Question

I seem to be running into issues trying to combine numeric (continuous) features with factors. I am using Pandas DataFrames to input to the model. Right now, my code works with factors like 'gender' which can be easily transformed using built-in transformers:

('gender', Pipeline([
('selector', ColumnSelector(column='gender')),
('dict', DictTransformer()),
('vect', DictVectorizer(sparse=False))
]))

But when I try to combine that with a numeric factor (for example, latitude) as follows,

('latitude', Pipeline([
('selector', ColumnSelector(column='latitude')),
('scaler', StandardScaler())
]))

I get an error:

ValueError: all the input arrays must have same number of dimensions

Here is my code for ColumnSelector():

class ColumnSelector(TransformerMixin):
    """
    Class for building sklearn Pipeline step. This class should be used to select a column from a pandas data frame.
    """

    def __init__(self, column):
        self.column = column

    def fit(self, x, y=None):
        return self

    def transform(self, data_frame):
        return data_frame[self.column]

Obviously I'm missing something important here. Any ideas?

Can you the sequence of transformations to work outside the context of a pipeline? THe pipeline may be making troubleshooting more difficult — Ryan, Dec 15 '15 at 18:37
I'm sure I can do that, and I may have to, but it doesn't solve this question. The pipeline combined with FeatureUnion seem really convenient, so I want to figure out how to make this work. — Evan Zamir, Dec 15 '15 at 18:44
Where this error exactly comes from? File/line number. Also, if you want to transform each column with different transformer - look at http://stackoverflow.com/a/34202758/1030820 — Ibraim Ganiev, Dec 15 '15 at 19:48

score 2 · Accepted Answer · answered Dec 16 '15 at 05:22

Using Pipelines within FeatureUnion should work. The problem here is likely related to the implementation of ColumnSelector. Notice that it outputs a single dimensional structure upon each transformation; however, interfaces in scikit-learn generally expect input of a 2D shape, i.e. (n_sample, n_feature).

Assuming the input to ColumnSelector is a pandas DataFrame, try changing the code to:

class ColumnSelector(TransformerMixin):
   ...

   def transform(self, data_frame):
       return data_frame[[self.column]]

which makes the transformed output to have a 2D shape.

Internally, FeatureUnion uses hstack to perform combination of feature matrices. This is a minimal example that causes hstack to complain about dimension mismatch in the way as described in the question:

import numpy as np
a = np.array([[1,0],
              [0,1]])
b = np.array([2,3])
print np.hstack((a,b))
# ValueError: all the input arrays must have same number of dimensions

However, this works:

print np.hstack((a, b[:, np.newaxis]))
# array([[1, 0, 2],
#        [0, 1, 3]])

since now b[:, np.newaxis] has two dimensions.

Yep, this was the issue and the fix. Thanks! – Evan Zamir Dec 16 '15 at 05:58 — Evan Zamir, Dec 16 '15 at 05:58

How to Combine Numeric and Categorical features in scikit-learn Pipelines?

1 Answers1