0

I'm trying to use classifierChain for a multilabel classification problem and following this tutorial:

https://code.i-harness.com/en/docs/scikit_learn/auto_examples/multioutput/plot_classifier_chain_yeast

from pmlb import fetch_data
from sklearn.multioutput import ClassifierChain
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns    

# load dataset and descriptive statistics
dataset_Name = 'yeast'; dataset = fetch_data(dataset_Name)

print();  print(dataset.head())    
print();  print(dataset.columns)

cols = ['mcg', 'gvh', 'alm', 'mit', 'erl', 'pox', 'vac', 'nuc']

print();  print(dataset[cols].info())    
print();  print(dataset[cols].describe())
print();  print(dataset[cols].corr())    

# load features and target from dataset
X, y = fetch_data(dataset_Name, return_X_y=True)

# Split Train and Test Datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 

chains = [ClassifierChain(LogisticRegression(), order=[1,0,2,4,3,5,6,7,8], random_state=i)
      for i in range(9)]
print(chains)
for chain in chains:
   chain.fit(X_train, Y_train)

I' m getting the error: 'tuple index out of range'. Can any one guide me about this error as I'm unable to understand this error? Full trace of error is below:

IndexError  Traceback (most recent call last)
<ipython-input-41-d020752b05d2> in <module>
  5 print(chains)
  6 for chain in chains:
   ----> 7     chain.fit(X_train, Y_train)
  ~\Anaconda3\envs\tensorflow\lib\site-packages\sklearn\multioutput.py      fit  (self, X, Y)
  465             if self.order_ == 'random':
  466                 self.order_ = random_state.permutation(Y.shape[1])
  --> 467         elif sorted(self.order_) != list(range(Y.shape[1])):
  468                 raise ValueError("invalid order")
  469 

IndexError: tuple index out of range

Aizayousaf
  • 39
  • 15

1 Answers1

2

This issue happens because your y_train variable is a 1D array. To fix that, you only need to make it a 2D array using reshape() method. So, its shape will be #(1183, 1) instead of #(1183, ).

Also, you will need to change the order argument. According to the documentation, it should be a list of y_train.shape[1]-1 which is 0. So, use random instead.

So, your code should look like this:

...
# load features and target from dataset
X, y = fetch_data(dataset_Name, return_X_y=True)

# # Split Train and Test Datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 
y_train = y_train.reshape(-1, 1)   #<--- add this

chains = [ClassifierChain(LogisticRegression(), order="random", random_state=i)
      for i in range(9)]  #<-- use order="random"
for chain in chains:
    chain.fit(X_train, y_train)
Anwarvic
  • 12,156
  • 4
  • 49
  • 69
  • I solved it by doing one hot encode of y labels and it solved he tuple index problem but now I m getting another error can you check my other error in answer of this post? – Aizayousaf Jun 13 '20 at 22:16