0

I am building a program that assigns multiple labels/tags to textual descriptions. I am using the OneVsRestClassifier to label my textual descriptions. xTrain, xTest, and yTrain are all 'numpy.ndarray'. This does seem strange considering that I have splitting the training and test data in the correct manner. Below is my code:

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)

nb_clf = MultinomialNB()
sgd = SGDClassifier()
lr = LogisticRegression()
mn = MultinomialNB()

print("xTrain.shape = " + str(xTrain.shape))
print("xTest.shape = " + str(xTest.shape))
print("yTrain.shape = " + str(yTrain.shape))
print("yTest.shape = " + str(yTest.shape))

print("type(xTrain) = " + str(type(xTrain)))
print("type(xTest) = " + str(type(xTest)))

xTrain = csr_matrix(xTrain).toarray()
xTest = csr_matrix(xTest).toarray()
yTrain = csr_matrix(yTrain).toarray()

print("type(xTrain) = " + str(type(xTrain)))

for classifier in [nb_clf, sgd, lr, mn]:
    clf = OneVsRestClassifier(classifier)
    clf.fit(xTrain.astype("U"), yTrain.astype("U"))
    y_pred = clf.predict(xTest)
    print("\ny_pred:")
    print(y_pred)

x output:

  (1466, 1292)  0.13531037414782607
  (1466, 1238)  0.21029405543816293
  (1466, 988)   0.04688335706505732
  ...
  ...

y ouput:

[[0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

print statements output:

xTrain.shape = (1173, 13817)
xTest.shape = (294, 13817)
yTrain.shape = (1173, 28)
yTest.shape = (294, 28)
type(xTrain) = <class 'scipy.sparse.csr.csr_matrix'>
type(xTest) = <class 'scipy.sparse.csr.csr_matrix'>
type(xTrain) = <class 'numpy.ndarray'>
type(xTest) = <class 'numpy.ndarray'>
type(yTrain) = <class 'numpy.ndarray'>

error (at the clf.fit line):

ValueError: Multioutput target data is not supported with label binarization

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Henry Zhu
  • 2,488
  • 9
  • 43
  • 87

1 Answers1

1

Please first clarify the feature dimension as well as sample size in your program. For the target feature (y), the label should not be one-hot encoded. For example, instead of [0 0 0 1], it should be [3]

brentertainer
  • 2,118
  • 1
  • 6
  • 15
  • Is there any way I can make my target array not one-hot encoded? Right now it is, so I was wondering if there's some sort of prebuilt function that can LabelEncode it. – Henry Zhu Aug 07 '19 at 00:19
  • numpy.argmax(array) return the indice of max in an array : in your case, the value you're looking for – CoMartel Aug 07 '19 at 08:48