0

I have a text dataset which has one column for reviews and another column for labels. I want to build a decision tree model by using that dataset, I used vectorizer but it gives ValueError: Number of labels=37500 does not match number of samples=1 error. vect.vocabulary_ returns {'review': 0} review is the column name. So I think it does not fit to all data. Here is the code below, any help is appreciated.

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(data.iloc[:,:-1],data.iloc[:,-1:],
test_size = 0.25, random_state = 42)

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

from sklearn.tree import DecisionTreeClassifier 
DTC = DecisionTreeClassifier()
DTC.fit(X_train_dtm, y_train)
y1_pred_class = DTC.predict(X_test_dtm)

Also X_train_dtm.shape is <bound method spmatrix.get_shape of <1x1 sparse matrix of type '<class 'numpy.int64'>' with 1 stored elements in Compressed Sparse Row format>>

imdatyaa
  • 45
  • 1
  • 8

2 Answers2

2

CountVectorizer requires 1-dimensional inputs, and the error suggests that your X_train is 2d. If it's a dataframe, reduce to a series; if it's a numpy array, use reshape or ravel.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
0

It worked when I changed this part:

X_train, X_test,y_train, y_test = train_test_split(data['text'], data['tag'],test_size = 0.25, random_state = 42)

imdatyaa
  • 45
  • 1
  • 8