0

I want perform multilabel classification. A have a dataset in arff format which I load. However I don't now how convert import data to X and y vectors in order to apply sklearn/train_test_split.

How can I get X and y?

data, meta = scipy.io.arff.loadarff('../yeast-train.arff')
df = pd.DataFrame(data)

#Get X, y
X, y = ??? <---

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
msoares
  • 13
  • 4
  • From where have you downloaded the `yeast-train.arff`? There must be a column inside the `data` which is your `y`. And all other columns (excluding the target column `y`) become your `X`. – Vivek Kumar Sep 06 '17 at 10:34
  • I download it from [here](http://sourceforge.net/projects/mulan/files/datasets/yeast.rar) – msoares Sep 06 '17 at 20:27

1 Answers1

0

Ok. Its a multilabel data in which features are in the columns Att1, Att2, Att3.... Att20 and targets are in the columns Class1, Class2, .... Class14.

So you need to use those columns for getting the X and y. Do it like this:

# Fill the .... with all other column names
feature_cols = ['Att1', 'Att2', 'Att3', 'Att4', 'Att5' ....   'Att20']
target_cols = ['Class1', 'Class2', 'Class3', 'Class4', ....   'Class14']

X, y = df[feature_cols], df[target_cols]
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132