I'm having some difficulties with fitting a pipeline on a data set for the purpose of K-means clustering using scikit-learn. For illustration purposes let's say i have a certain DataFrame like the following (only much larger):
df = pd.DataFrame({'Text': ['the traveler', 'alien', 'titanic', 'real leather', 'prestige'], 'Numeric': [22, 41, 2, 78, 3]})
Now i have written a FunctionTransformer for obtaining both text and numeric features from the dataframe:
get_text_data = FunctionTransformer(lambda x: x['Text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x['Numeric'], validate=False)
after that i wrote the following pipeline using sklearn.pipeline:
pl = Pipeline([('union', FeatureUnion(transformation_list = [('numeric_features', Pipeline([('selector', get_numeric_data), ('imputer', Imputer())])), ('text_features', Pipeline([('selector', get_text_data), 'vectorizer', TfidfVectorizer(token_pattern=TOKENS_ALPHANUMERIC))]))])), ('kmeans', KMeans(n_clusters=4))])
Now i have tried to fit the pipeline like this:
pl.fit(df)
And i get the following error:
ValueError: blocks[0,:] has incompatible row dimensions
I'm a bit lost and will appreciate any help in the matter from you guys Thanks.