RuntimeError: A pipeline has not yet been optimized. Please call fit() first.Problem with TPOT Automated Machine Learning in Python

Question

When executing a sample code, I am encountering the following problem: "RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

The Problem with TPOT Automated Machine Learning in Python. I am trying to make the example: Dataset 2: Mushroom Classification (https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9)

source code: https://www.kaggle.com/discdiver/tpot-mushroom-classification-task/

I tried to change the position of tpot.fit (X_train, y_train), but it doesn't solve the problem.

Library

import time
import gc
import pandas as pd
import numpy as np
import seaborn as sns
import timeit
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(font_scale=1.5, palette="colorblind")
import category_encoders

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from tpot import TPOTClassifier

# Read data
df_cogumelo = pd.read_csv('agaricus-lepiota.csv')  

# Visualization
pd.options.display.max_columns = 200
pd.options.display.width = 200

# separate out X
X = df_cogumelo.reindex(columns=[x for x in df_cogumelo.columns.values if x != 'class']) 

X = X.apply(LabelEncoder().fit_transform)

# separate out y
y = df_cogumelo.reindex(columns=['class'])   
print(y['class'].value_counts())
y = np.ravel(y)                     # flatten the y array
y = LabelEncoder().fit_transform(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=10) 


print(X_train.describe())
print("\n\n\n")
print(X_train.info())

# generation and population_size determine how many populations are made.

tpot = TPOTClassifier(verbosity=3, 
                  scoring="accuracy", 
                  random_state=10, 
                  periodic_checkpoint_folder="tpot_mushroom_results", 
                  n_jobs=-1, 
                  generations=2, 
                  population_size=10, use_dask=True) #use_dask=True

times = []
scores = []
winning_pipes = []

# run several fits 
for x in range(10):
start_time = timeit.default_timer()

tpot.fit(X_train, y_train)

elapsed = timeit.default_timer() - start_time
times.append(elapsed)

winning_pipes.append(tpot.fitted_pipeline_)

scores.append(tpot.score(X_test, y_test))
tpot.export('tpot_mushroom.py')


# output results
times = [time/60 for time in times]
print('Times:', times)
print('Scores:', scores)   
print('Winning pipelines:', winning_pipes)

#The expected result is as follows: 
#https://www.kaggle.com/discdiver/tpot-#mushroom-classification-task/

score 0 · Answer 1 · answered Sep 07 '20 at 10:59

0

Removing "use_dask=True" solved the error for me.

answered Sep 07 '20 at 10:59

Muhammad Husnain

31
4

And if you want to use dask, you can install it via pip install dask==2.18.1. – Muhammad Husnain Sep 08 '20 at 06:21

score -1 · Answer 2 · answered Mar 17 '20 at 16:50

You problem is not the code it is your data. That mushroom dataset has not no header row. Go into the file and insert a new first row and label the columns (doens't matter what) making sure the last column is named 'class' (lowercase c). That should fix the problem. If you look at your output, when you print the y['class'] count you get None. If you already added the labels correctly, then please send the output stack trace.

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.Problem with TPOT Automated Machine Learning in Python

Library

2 Answers2