I just started to use sklearn and I want to categorize products. The products appear on order lines and have properties like, a description, a price, a manufacturer, order quantity etc. Some of these properties are text and others are numbers (integers or floats). I want to use these properties to predict if the product needs maintenance. Products we buy can be things like engines, pumps, etc but also nuts, hoses, filters etc. So far I did a prediction based on the price and quantity and I did other predictions based on the description or manufacturer. Now I want to combine these predictions but I'm not sure how to do that. I've seen the Pipeline and FeatureUnion pages but it is confusing to me. Does anybody have a simple example on how to predict data which has both text and number columns at the same time?
I now have:
order_lines.head(5)
Part No Part Description Quantity Price/Base Supplier Name Purch UoM Category
0 1112165 Duikwerkzaamheden 1.0 750.00 Duik & Bergingsbedrijf Europa B.V. pcs 0
1 1112165 Duikwerkzaamheden bij de helling 1.0 500.00 Duik & Bergingsbedrijf Europa B.V. pcs 0
2 1070285 Inspectie boegschroef, dd. 26-03-2012 1.0 0.01 Duik & Bergingsbedrijf Europa B.V. pcs 0
3 1037024 Spare parts Albanie Acc. List 1.0 3809.16 Lastechniek Europa B.V. - 0
4 1037025 M_PO:441.35/BW_INV:0 1.0 0.00 Exalto pcs 0
category_column = order_lines['Category']
order_lines = order_lines[['Part Description', 'Quantity', 'Price/Base', 'Supplier Name', 'Purch UoM']]
from sklearn.cross_validation import train_test_split
features_train, features_test, target_train, target_test = train_test_split(order_lines, category_column, test_size=0.20)
from sklearn.base import TransformerMixin, BaseEstimator
class FeatureTypeSelector(TransformerMixin, BaseEstimator):
FEATURE_TYPES = {
'price and quantity': [
'Price/Base',
'Quantity',
],
'description, supplier, uom': [
'Part Description',
'Supplier Name',
'Purch UoM',
],
}
def __init__(self, feature_type):
self.columns = self.FEATURE_TYPES[feature_type]
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.columns]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import RobustScaler
preprocessor = make_union(
make_pipeline(
FeatureTypeSelector('price and quantity'),
RobustScaler(),
),
make_pipeline(
FeatureTypeSelector('description, supplier, uom'),
CountVectorizer(),
),
)
preprocessor.fit_transform(features_train)
And then I got this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-f8b0db33462a> in <module>()
----> 1 preprocessor.fit_transform(features_train)
C:\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
500 self._update_transformer_list(transformers)
501 if any(sparse.issparse(f) for f in Xs):
--> 502 Xs = sparse.hstack(Xs).tocsr()
503 else:
504 Xs = np.hstack(Xs)
C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
462
463 """
--> 464 return bmat([blocks], format=format, dtype=dtype)
465
466
C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
579 else:
580 if brow_lengths[i] != A.shape[0]:
--> 581 raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
582
583 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions