Not sure how to use sklearn with feature vectors which contain both text and numbers

Question

I just started to use sklearn and I want to categorize products. The products appear on order lines and have properties like, a description, a price, a manufacturer, order quantity etc. Some of these properties are text and others are numbers (integers or floats). I want to use these properties to predict if the product needs maintenance. Products we buy can be things like engines, pumps, etc but also nuts, hoses, filters etc. So far I did a prediction based on the price and quantity and I did other predictions based on the description or manufacturer. Now I want to combine these predictions but I'm not sure how to do that. I've seen the Pipeline and FeatureUnion pages but it is confusing to me. Does anybody have a simple example on how to predict data which has both text and number columns at the same time?

I now have:

order_lines.head(5)

    Part No Part Description    Quantity    Price/Base  Supplier Name   Purch UoM   Category
0   1112165 Duikwerkzaamheden   1.0 750.00  Duik & Bergingsbedrijf Europa B.V.  pcs 0
1   1112165 Duikwerkzaamheden bij de helling    1.0 500.00  Duik & Bergingsbedrijf Europa B.V.  pcs 0
2   1070285 Inspectie boegschroef, dd. 26-03-2012   1.0 0.01    Duik & Bergingsbedrijf Europa B.V.  pcs 0
3   1037024 Spare parts Albanie Acc. List   1.0 3809.16 Lastechniek Europa B.V. -   0
4   1037025 M_PO:441.35/BW_INV:0    1.0 0.00    Exalto  pcs 0

category_column = order_lines['Category']
order_lines = order_lines[['Part Description', 'Quantity', 'Price/Base', 'Supplier Name', 'Purch UoM']]

from sklearn.cross_validation import train_test_split
features_train, features_test, target_train, target_test = train_test_split(order_lines, category_column, test_size=0.20)

from sklearn.base import TransformerMixin, BaseEstimator

class FeatureTypeSelector(TransformerMixin, BaseEstimator):
    FEATURE_TYPES = {
        'price and quantity': [
            'Price/Base',
            'Quantity',
        ],
        'description, supplier, uom': [
            'Part Description',
            'Supplier Name',
            'Purch UoM',
        ],
    }
    def __init__(self, feature_type):
        self.columns = self.FEATURE_TYPES[feature_type]

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import RobustScaler

preprocessor = make_union(
    make_pipeline(
        FeatureTypeSelector('price and quantity'),
        RobustScaler(),
    ),
    make_pipeline(
        FeatureTypeSelector('description, supplier, uom'),
        CountVectorizer(),
    ),
)
preprocessor.fit_transform(features_train)

And then I got this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-f8b0db33462a> in <module>()
----> 1 preprocessor.fit_transform(features_train)

C:\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
    500         self._update_transformer_list(transformers)
    501         if any(sparse.issparse(f) for f in Xs):
--> 502             Xs = sparse.hstack(Xs).tocsr()
    503         else:
    504             Xs = np.hstack(Xs)

C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
    462 
    463     """
--> 464     return bmat([blocks], format=format, dtype=dtype)
    465 
    466 

C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
    579                 else:
    580                     if brow_lengths[i] != A.shape[0]:
--> 581                         raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
    582 
    583                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions

just answered a very similar question . does this help you? http://stackoverflow.com/questions/39001956/sklearn-pipeline-transformation-on-only-certain-features — maxymoo, Aug 18 '16 at 02:39

Kris · Accepted Answer · 2016-08-18T09:49:26.947

I would suggest not to make predictions on different feature types and then combining. You're better off using FeatureUnion as you suggest, which allows you to create separate preprocessing pipelines for each feature type. A construction I often use is the following...

Let's define a toy example dataset to play around with:

import pandas as pd

# create a pandas dataframe that contains your features
X = pd.DataFrame({'quantity': [13, 7, 42, 11],
                  'item_name': ['nut', 'bolt', 'bolt', 'chair'],
                  'item_type': ['hardware', 'hardware', 'hardware', 'furniture'],
                  'item_price': [1.95, 4.95, 2.79, 19.95]})

# create corresponding target (this is often just one of the dataframe columns)
y = pd.Series([0, 1, 1, 0], index=X.index)

I glue everything together using Pipeline and FeatureUnion (or rather their simpler shortcuts make_pipeline and make_union):

from sklearn.pipeline import make_union, make_pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression

# create your preprocessor that handles different feature types separately
preprocessor = make_union(
    make_pipeline(
        FeatureTypeSelector('continuous'),
        RobustScaler(),
    ),
    make_pipeline(
        FeatureTypeSelector('categorical'),
        RowToDictTransformer(),
        DictVectorizer(sparse=False),  # set sparse=True if you get MemoryError
    ),
)

# example use of your combined preprocessor
preprocessor.fit_transform(X)

# choose some estimator
estimator = LogisticRegression()

# your prediction model can be created as follows
model = make_pipeline(preprocessor, estimator)

# and training is done as follows
model.fit(X, y)

# predict (preferably not on training data X)
model.predict(X)

Here, I defined my own custom transformers FeatureTypeSelector and RowToDictTransformer as follows:

from sklearn.base import TransformerMixin, BaseEstimator


class FeatureTypeSelector(TransformerMixin, BaseEstimator):
    """ Selects a subset of features based on their type """

    FEATURE_TYPES = {
        'categorical': [
            'item_name',
            'item_type',
        ],
        'continuous': [
            'quantity',
            'item_price',
        ]
    }

    def __init__(self, feature_type):
        self.columns = self.FEATURE_TYPES[feature_type]

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]


class RowToDictTransformer(TransformerMixin, BaseEstimator):
    """ Prepare dataframe for DictVectorizer """

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return (row[1] for row in X.iterrows())

Hope that this example paints a some clearer image of how to do feature union.

-Kris

Your example gives this error: `TypeError: fit_transform() takes 2 positional arguments but 3 were given` — Martijn de Munnik, Aug 18 '16 at 09:33
Upvote for showing workable example. Just my 2 cents , for categorical data , if it has limited vocab (like countries, genders) , then one-hot encoding will help. For text data like "Part Description" using average word vector is not a bad idea. Both case will give a length-fixed vector, which can be used jointly with other features. — Steven Du, Aug 19 '16 at 07:14

Not sure how to use sklearn with feature vectors which contain both text and numbers

1 Answers1