fit() missing 1 required positional argument: 'y'

Question

X = df.drop(columns="CLASS")
y = df.CLASS

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

preprocessor = ColumnTransformer([
    ('numeric', num_pipe(), ["PINJAM"]),
    ('categoric', cat_pipe(encoder='onehot'), ["JENIS KELAMIN", "STATUS PERNIKAHAN", "JUMLAH TANGGUNGAN"]),
])

from sklearn.naive_bayes import GaussianNB
pipeline = Pipeline([
    ('prep', preprocessor),
    ('algo', GaussianNB)
])

pipeline.fit(X_train, y_train)

Error:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 pipeline.fit(X_train, y_train)

File ~\anaconda3\envs\jcopml\lib\site-packages\sklearn\pipeline.py:394, in Pipeline.fit(self, X, y, **fit_params)
    392     if self._final_estimator != "passthrough":
    393         fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 394         self._final_estimator.fit(Xt, y, **fit_params_last_step)
    396 return self

TypeError: fit() missing 1 required positional argument: 'y'

How do I resolve this?

Maybe you can try `pipeline = Pipeline([('prep', preprocessor()), ('algo', GaussianNB())])`? — Anastasiya-Romanova 秀, Dec 20 '22 at 03:30
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [80], in () ----> 1 pipeline = Pipeline([('prep', preprocessor()), ('algo', GaussianNB())]) TypeError: 'ColumnTransformer' object is not callable — rckoprstyo, Dec 20 '22 at 03:56
Try `pipeline = Pipeline([('prep', preprocessor), ('algo', GaussianNB())])` — Anastasiya-Romanova 秀, Dec 20 '22 at 04:17

Kam Sen · Answer 1 · 2022-12-20T13:05:07.057

It's always better to give a fully working example in your question. This can and should be minimal. As @anastasiya-Romanova pointed out, you have to follow the right init methods for the pipeline, which is also shown here.

from sklearn.datasets import make_blobs
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
import pandas as pd

# Generate synthetic data + make a pseudo-categorical column with qcut
X, y = make_blobs(n_samples=1000, centers=2, random_state=42)
X = pd.DataFrame(X)
X.columns = ["feat1", "feat2"]
X["feat2"] = pd.qcut(X["feat2"], 3, labels=False, duplicates="drop")

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the pipeline
pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([('scaler', StandardScaler(), ["feat1"]),
                                        ('onehot', OneHotEncoder(handle_unknown='ignore'), ['feat2'])
                                        ])),
    ('classifier', GaussianNB())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)
# Evaluate the model on the test data
accuracy = pipeline.score(X_test, y_test)
print('Test accuracy:', accuracy)

# show what the preprocessor is doing
X_transformed = pd.DataFrame(pipeline.named_steps['preprocessor'].transform(X))
print(X_transformed.head())

This prints:

          0    1    2    3
0 -0.757494  0.0  1.0  0.0
1  1.396373  1.0  0.0  0.0
2  0.648693  1.0  0.0  0.0
3  1.085098  1.0  0.0  0.0
4  0.895531  0.0  1.0  0.0
Test accuracy: 1.0

For completeness, the linked documentation from sklearn demonstrates how to use the pipeline in such a way:

>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.pipeline import Pipeline
>>> X, y = make_classification(random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
...                                                     random_state=0)
>>> pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
>>> # The pipeline can be used as any other estimator
>>> # and avoids leaking the test set into the train set
>>> pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())])
>>> pipe.score(X_test, y_test)
0.88

fit() missing 1 required positional argument: 'y'

1 Answers1