0

I am following this piece of code http://queirozf.com/entries/scikit-learn-pipeline-examples in order to develop a Multilabel OnevsRest classifier for text. I would like to compute the hamming_score and thus would need to binarize my test labels as well. I thus have:

        X_train, X_test, labels_train, labels_test = train_test_split(meetings, labels, test_size=0.4)

Here, labels_train and labels_test are list of lists

    [['dog', 'cat'], ['cat'], ['people'], ['nice', 'people']]

Now I need to binarize all my labels, I am therefore doing this...

     all_labels = np.vstack([labels_train, labels_test])
     mlb = MultiLabelBinarizer().fit(all_labels)

As directed by in the link. But that throws

    ValueError: all the input array dimensions except for the concatenation axis must match exactly

I used np.column_stack as directed here

numpy array concatenate: "ValueError: all the input arrays must have same number of dimensions"

but that throws the same error.

How can the dimensions be the same if I am splitting on train and test, I am bound to get different shapes right? Please help, thank you.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Shiva Kumar
  • 175
  • 1
  • 12
  • When using functions like `vstack` and `column_stack` make sure you know the `shape` of the component arrays - or arrays that will be produced with `np.array(....)`. Don't throw variables together and hope they work. – hpaulj May 14 '18 at 23:27
  • That `dog/cat` list has 4 items, some are 2 long, some 1 long. That does not look good for `stacking`. What do you want to produce? – hpaulj May 14 '18 at 23:29

1 Answers1

0

MultilabelBinarizer works on list of lists directly, so you dont need to stack them using numpy. Directly send the list without stacking.

all_labels = labels_train + labels_test
mlb = MultiLabelBinarizer().fit(all_labels)
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132