I am trying to do grid search over a sklearn pipeline that uses a custom transformer in a pipeline with FeatureUnion. It works fine when the pipeline uses the custom transformer class in FeatureUnion; however, it fails when the custom class is ignored in the pipeline by setting passthrough
in the grid search parameters.
The full pipeline is defined as follows:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline, FeatureUnion
ngram_vectorizer = Pipeline([
("vectorizer", CountVectorizer(analyzer="char_wb", ngram_range=(1,3))),
("tfidf", TfidfTransformer())
])
pipe_full = Pipeline(
[
("features", FeatureUnion(
[
("ngrams", ngram_vectorizer),
("lengths", TextLengthExtractor())
]
)
),
("classifier", MultinomialNB())
]
)
The custom transformer class TextLengthExtractor
simply computes the number of characters from an input string:
from sklearn.base import BaseEstimator, TransformerMixin
class TextLengthExtractor(BaseEstimator, TransformerMixin):
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
string_lengths = np.array([len(doc) for doc in X])
return string_lengths.reshape(-1,1)
The tuning parameters for grid search are defined through a dictionary params
. Importantly, the parameters for the custom TextLengthExtractor
contain the passthrough
option to ignore the entire features__lengths
step from the pipeline (see also the sklearn's documentation on pipelines):
params = {
"features__lengths": [TextLengthExtractor(), "passthrough"],
"features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
}
When the pipeline is fit on the following dummy data
X_train_dummy = ["a", "ab", "a bc", "aaaaa", "b ab cc b", "ba", "baba", "cc bb aa", "c", "bca"]
y_train_dummy = [1,0,1, 1, 0, 1, 0, 1, 0, 0]
pipe_full.fit(X_train_dummy, y_train_dummy)
it can be seen that the lengths
step of the FeatureUnion
pipeline works as expected:
pipe_full["features"].get_params()["lengths"].transform(X_train_dummy)
# gives the following output of shape (10,1)
# array([[1], [2], [4], [5], [9], [2], [4], [8], [1], [3]])
However - and now comes the problem - when grid search is performed as follows:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipe_full, params, cv=5, n_jobs=-1, verbose=10)
grid_search.fit(X_train_dummy, y_train_dummy)
all fits that ignore the lengths
step (as defined by the passthrough
option from params["features__lengths"]
throw the following error:
5 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 378, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 336, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "C:\dev\NameClassification\venv\lib\site-packages\joblib\memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 870, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 1162, in fit_transform
return self._hstack(Xs)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 1216, in _hstack
Xs = sparse.hstack(Xs).tocsr()
File "C:\dev\NameClassification\venv\lib\site-packages\scipy\sparse\_construct.py", line 532, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "C:\dev\NameClassification\venv\lib\site-packages\scipy\sparse\_construct.py", line 665, in bmat
raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 8.
I do understand that both steps require identical row dimensions for both ngrams
and lengths
in the FeatureUnion
, where the number of rows in the extracted feature matrices must equal the number of samples in the respective split. However, I have no idea how to control the shape of matrices when ignoring the lengths
part of FeatureUnion
using the passthrough
option in the gird search params.
I have found any solution to the problem on SE or any other sklearn related resource. Does anyone have an idea on how to solve the issue?