Error with fitting a pipeline for clustering both text and numeric data

Question

I'm having some difficulties with fitting a pipeline on a data set for the purpose of K-means clustering using scikit-learn. For illustration purposes let's say i have a certain DataFrame like the following (only much larger):

df = pd.DataFrame({'Text': ['the traveler', 'alien', 'titanic', 'real leather', 'prestige'], 'Numeric': [22, 41, 2, 78, 3]})

Now i have written a FunctionTransformer for obtaining both text and numeric features from the dataframe:

get_text_data = FunctionTransformer(lambda x: x['Text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x['Numeric'], validate=False)

after that i wrote the following pipeline using sklearn.pipeline:

pl = Pipeline([('union', FeatureUnion(transformation_list = [('numeric_features', Pipeline([('selector', get_numeric_data), ('imputer', Imputer())])), ('text_features', Pipeline([('selector', get_text_data), 'vectorizer', TfidfVectorizer(token_pattern=TOKENS_ALPHANUMERIC))]))])), ('kmeans', KMeans(n_clusters=4))])

Now i have tried to fit the pipeline like this:

pl.fit(df)

And i get the following error:

ValueError: blocks[0,:] has incompatible row dimensions

I'm a bit lost and will appreciate any help in the matter from you guys Thanks.

Can you post the full error message, is it the `FeatureUnion`, `Pipeline`, `Imputer`, `TfidfVectorizer` or `KMeans` that's producing that error? Also does `sklearn` accept `pandas` data frames as input, I thought it only takes numpy arrays and scipy sparse matrices? — Matti Lyra, May 07 '17 at 10:28
Could you add line breaks to your code? It's hard to read right now. — Imanuel, May 07 '17 at 10:39
Well i've tried other techniques rather then k-means, including classifiers (although it makes no sense in this context) and i keep getting the same error. I have also tried to fit the pipeline this way: pl.fit(df.values). but then i get another error: IndexError: only integers, slices (':'), ellipsis ('...'), numpy.newaxis('None') and integer or boolean arrays are valid indices — Uri T, May 07 '17 at 10:51
@MattiLyra Yes, it does take Pandas dataframe. It will internally convert it to appropriate one. To the OP, unless you post the data to duplicate this behaviour, and the complete stack trace of error, we are unable to help — Vivek Kumar, May 07 '17 at 14:39

Error with fitting a pipeline for clustering both text and numeric data

0 Answers0