0

I am trying to train a meta classifier on different features from a pandas dataframe.

The features are either text or categorical in nature.

I am having issues with fitting the model, with the following error 'Found input variables with inconsistent numbers of samples: [1, 48678]'. I understand what the error means, but not how to fix it. Help much appreciated!

The code I am using is as follows:

import pandas as pd
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

# set target label
target_label = ['target']
features = ['cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_5', 
'text_1']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cleansed_data[features], 
cleansed_data[target_label], test_size=0.2, random_state=0)

text_features = ['text_1']
categorical_features = ['cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_5']

# encoder
le = preprocessing.LabelEncoder()

# vectoriser
vectoriser = TfidfVectorizer()

# classifiers
mlp_clf = MLPClassifier()
rf_clf = RandomForestClassifier()

from sklearn.base import TransformerMixin, BaseEstimator
class SelectColumnsTransfomer(BaseEstimator, TransformerMixin):

    def __init__(self, columns=[]):
    self.columns = columns

    def transform(self, X, **transform_params):
        trans = X[self.columns].copy()
        return trans

    def fit(self, X, y=None, **fit_params):
    return self

# text pipeline
text_steps = [('feature extractor', SelectColumnsTransfomer(text_features)),
          ('tf-idf', vectoriser),
          ('classifier', mlp_clf)]

# categorical pipeline
categorical_steps = [('feature extractor', 
SelectColumnsTransfomer(categorical_features)),
                 ('label encode', le),
                 ('classifier', rf_clf)]

pl_text = Pipeline(text_steps)
pl_categorical = Pipeline(categorical_steps)

pl_text.fit(X_train, y_train)

from mlxtend.classifier import StackingCVClassifier
sclf = StackingCVClassifier(classifiers=[pl_text, pl_categorical],
                      use_probas=True,
                      meta_classifier=LogisticRegression())

EDIT: Here is some code that recreates the issue. 'ValueError: Found input variables with inconsistent numbers of samples: [1, 3]'

d = {'cat_1': ['A', 'A', 'B'], 'cat_2': [1, 2, 3], 
'cat_2': ['G', 'H', 'I'], 'cat_3': ['AA', 'DD', 'PP'], 
'cat_4': ['X', 'B', 'V'], 
'text_1': ['the cat sat on the mat', 'the mat sat on the cat', 'sat on the cat mat']} 
features = pd.DataFrame(data=d)

t = [0, 1, 0]
target = pd.DataFrame(data=t)

text_features = ['text_1']
categorical_features = ['cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_5']

# text pipeline
text_steps = [('feature extractor', SelectColumnsTransfomer(text_features)),
              ('tf-idf', vectoriser),
              ('classifier', mlp_clf)]

# categorical pipeline
categorical_steps = [('feature extractor', 
SelectColumnsTransfomer(categorical_features)),
                 ('label encode', le),
                 ('classifier', rf_clf)]

pl_text = Pipeline(text_steps)
pl_categorical = Pipeline(categorical_steps)

pl_text.fit(features, target)

from mlxtend.classifier import StackingCVClassifier
sclf = StackingCVClassifier(classifiers=[pl_text, pl_categorical],
                          use_probas=True,
                          meta_classifier=LogisticRegression())

sclf.fit(features, target)
amanbirs
  • 1,078
  • 6
  • 11
shbfy
  • 2,075
  • 3
  • 16
  • 37
  • 1
    Without sample data, no one will be able to help you here; plus, should *we* guess where exactly in this whole bunch of code does the error happen? Please see [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve), and edit your question accordingly – desertnaut Nov 18 '17 at 15:56
  • 1
    Thanks desertnaut, I will add some sample data shortly. – shbfy Nov 18 '17 at 16:36

1 Answers1

2

Ok, I managed to get it to work by replacing text_features = ['text_1'] with text_features = 'text_1'

Basically, when you pass ['text_1'] to the SelectColumnsTransfomer class it returns a DataFrame object which the tfidf vectoriser sees as one single input. The vectoriser applies fit_transform in your pipeline and returns a single value. This single value with cannot be used to predict three target values.

If you pass in 'text_1', this will get you a series and the vectoriser will correctly identify that you have three strings as features. You text pipeline will work now.

amanbirs
  • 1,078
  • 6
  • 11
  • Thank you, really appreciated. This has fixed the issue on the example but I'm now getting another error on the actual data... Not sure how to recreate this one.. 'DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)' – shbfy Nov 18 '17 at 21:55
  • hmmm, is there any way for me to recreate the error? – amanbirs Nov 19 '17 at 07:20
  • actually, I think you could fice this just by passing y[0] instead of y. Hit me up in chat if you want more help. – amanbirs Nov 20 '17 at 07:35
  • Apologies for belated response, I have been struggling to recreate the issue with a simple example for you. I'll give that a go and let you know if it works. Thanks for your help - genuinely appreciated. – shbfy Nov 20 '17 at 13:04