0

I'm failing to process the below code in the pipeline (it's imblearn pipepline)

features = training_data.loc[:, training_data.columns[:-1]]
labels = training_data.loc[:, training_data.columns[-1:]]


X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)
print(X_train.shape, y_train.shape)

The print output for the shapes is: (80, 6) (80, 1)

algorithms = [   
    svm.LinearSVC(),
    ensemble.RandomForestClassifier(),

]

def train(algorithm, X_train, y_train):
    model = Pipeline([
        ('vect', TfidfVectorizer()),
        ('smote', SMOTETomek()),
        ('chi', SelectKBest(chi2, k=1000)),
        ('classifier', algorithm)
    ])
    model.fit(X_train, y_train)
    return model

score_dict = {}
algorithm_to_model_dict = {}
for algorithm in algorithms:
    model = train(algorithm, X_train, y_train)
    score = model.score(X_test, y_test)
    score_dict[algorithm] = int(score * 100)
    algorithm_to_model_dict[algorithm] = model

The features and labels are all text (I'm doing text analysis).

The exception is being raised from the fit call

What am I doing wrong?

Ben
  • 421
  • 6
  • 19

1 Answers1

2

This is happening because you have a text transformer object in your pipeline. The problem with this approach is that the pipeline will pass the whole dataframe to the TfidfVectorizer. However, the text transformers of scikit-learn expect a 1d input.

Passing a 2d dataframe to TfidfVectorizer causes some weird processing where it mistakes the column names as documents. You can check with this simple example:

X = pd.DataFrame({
    'f1': ['This is doc1', 'This is doc2',
           'This is doc3', 'This is doc4', 'This is doc5'],
    'f2': [0, 1, 1, 0, 0]
})

vec = TfidfVectorizer()
print(vec.fit(X).get_feature_names())

>>> ['f1', 'f2']

This explains why the error message states that there is an inconsistent number of samples: the TfidfVectorizer thought of the 6 columns in the dataframe to be the samples and their names to be the features.

If you want to use TfidfVectorizer in your pipeline, you have to make sure that only the column with the text documents is passed to it. You can achieve this by wrapping it in a ColumnTransformer:

# if only one column needs to be transformed
transformer = ColumnTransformer(
    [('vec', TfidfVectorizer(), column)],   # column should be a string or int
    remainder='passthrough'
)

# if more than one column needs to be transformed (discouraged, see Note below)
transformer = ColumnTransformer(
    [('vec', TfidfVectorizer(), col_1),  # col_1 should be a string or int
     ...
     ('vec', TfidfVectorizer(), col_n)],   # col_n should be a string or int
    remainder='passthrough'
)

Replace column above with the index or the name of the column the TfidfVectorizer has to transform and it will only process this particular column. The remainder='passthrough' will make sure the other columns are left as is and are concatenated with the result. You can then use it in your pipeline like this:

model = Pipeline([
    ('vect', transformer),
    ('smote', SMOTETomek()),
    ('chi', SelectKBest(chi2, k=1000)),
    ('classifier', algorithm)
])

Note

If you have to transform several columns with text, you should consider merging the column entries into a single, combined document and only transform this combined document. Otherwise, each column will be treated with a new vocabulary, although these might overlap to some extent, and you might end up with a very high dimensionality / a lot of features.

afsharov
  • 4,774
  • 2
  • 10
  • 27
  • I've tried your suggestion as follows: `transformer = ColumnTransformer([('vec', TfidfVectorizer(), list(features.columns))], remainder='passthrough')` but I'm still getting similar error: `Found input variables with inconsistent numbers of samples: [6, 58]` – Ben May 31 '21 at 09:57
  • When you pass a list of column names/indices to `ColumnTransfer`, it will again pass a 2d dataframe to `TfidfVectorizer`. If you really want to vectorize each column separately, you will have to add one tuple for each column in the transformer list. Will edit the answer to show this. – afsharov May 31 '21 at 10:14
  • However, vectorizing each column separately is not a good option because you might end up with a really large dimensionality, although the vocabulary of each column will probably overlap to some extent. It would probably be better that you combine all records (column values with text) into a single record in a single column. – afsharov May 31 '21 at 10:16
  • I've already tried single column for all text columns - the score was bad – Ben May 31 '21 at 10:22
  • After using the discouraged approach, I'm getting `ValueError: empty vocabulary; perhaps the documents only contain stop words` – Ben May 31 '21 at 10:26
  • This `ValueError` is a completely different problem and relates to the data itself, and not the transformers or the way they are set up. This happens for example if your documents consist of a simple word/character like `a` and not a sentence. You can also refer to [here](https://stackoverflow.com/questions/20928769/python-tfidfvectorizer-throwing-empty-vocabulary-perhaps-the-documents-only-c) – afsharov May 31 '21 at 10:32