This is happening because you have a text transformer object in your pipeline. The problem with this approach is that the pipeline will pass the whole dataframe to the TfidfVectorizer
. However, the text transformers of scikit-learn
expect a 1d input.
Passing a 2d dataframe to TfidfVectorizer
causes some weird processing where it mistakes the column names as documents. You can check with this simple example:
X = pd.DataFrame({
'f1': ['This is doc1', 'This is doc2',
'This is doc3', 'This is doc4', 'This is doc5'],
'f2': [0, 1, 1, 0, 0]
})
vec = TfidfVectorizer()
print(vec.fit(X).get_feature_names())
>>> ['f1', 'f2']
This explains why the error message states that there is an inconsistent number of samples: the TfidfVectorizer
thought of the 6 columns in the dataframe to be the samples and their names to be the features.
If you want to use TfidfVectorizer
in your pipeline, you have to make sure that only the column with the text documents is passed to it. You can achieve this by wrapping it in a ColumnTransformer
:
# if only one column needs to be transformed
transformer = ColumnTransformer(
[('vec', TfidfVectorizer(), column)], # column should be a string or int
remainder='passthrough'
)
# if more than one column needs to be transformed (discouraged, see Note below)
transformer = ColumnTransformer(
[('vec', TfidfVectorizer(), col_1), # col_1 should be a string or int
...
('vec', TfidfVectorizer(), col_n)], # col_n should be a string or int
remainder='passthrough'
)
Replace column
above with the index or the name of the column the TfidfVectorizer
has to transform and it will only process this particular column. The remainder='passthrough'
will make sure the other columns are left as is and are concatenated with the result. You can then use it in your pipeline like this:
model = Pipeline([
('vect', transformer),
('smote', SMOTETomek()),
('chi', SelectKBest(chi2, k=1000)),
('classifier', algorithm)
])
Note
If you have to transform several columns with text, you should consider merging the column entries into a single, combined document and only transform this combined document. Otherwise, each column will be treated with a new vocabulary, although these might overlap to some extent, and you might end up with a very high dimensionality / a lot of features.