0

I am having a training data set of 144 student feedback with 72 positive and 72 negative feedback respectively. The data set has two attributes namely data and target which contain the sentence and the sentiment(positive or negative) respectively. The testing data set contains 106 unlabeled feedback. Consider the following code:

import pandas as pd
feedback_data = pd.read_csv('output_svm.csv')
print(feedback_data)


data    target
0      facilitates good student teacher communication.  positive
1                           lectures are very lengthy.  negative
2             the teacher is very good at interaction.  positive
3                       good at clearing the concepts.  positive
4                       good at clearing the concepts.  positive
5                                    good at teaching.  positive
6                          does not shows test copies.  negative
7                           good subjective knowledge.  positive
8                           good communication skills.  positive
9                               good teaching methods.  positive
10   posseses very good and thorough knowledge of t...  positive

feedback_data_test = pd.read_csv('classified_feedbacks_test.csv')
print(feedback_data_test)

          data  target
0                                       good teaching.     NaN
1                                         punctuality.     NaN
2                    provides good practical examples.     NaN
3                              weak subject knowledge.     NaN
4                                   excellent teacher.     NaN
5                                         no strength.     NaN
6                      very poor communication skills.     NaN
7                      not able to clear the concepts.     NaN
8                                            punctual.     NaN
9                             lack of proper guidance.     NaN
10                                  fantastic speaker.     NaN
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
ct = CountVectorizer(binary= True)
cv.fit(feedback_data['data'].values)
ct.fit(feedback_data_test['data'].values)
X = feedback_data['data'].apply(lambda X : cv.transform([X])).values
X = list([list(x.toarray()[0]) for x in X])
X_test = feedback_data_test['data'].apply(lambda X_test : ct.transform([X_test])).values
X_test = list([list(x.toarray()[0]) for x in X_test])




from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
target = [1 if i<72 else 0 for i in range(144)]
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
clf.fit(X, target)
#The below line gives error
print("Accuracy = %s" %accuracy_score(target,clf.predict([X_test])) )

I do not know what is wrong. Please help.

  • 1
    why would you put `clf.predict([X_test]))` and not just `clf.predict(X_test)`? – Frayal Feb 26 '19 at 09:10
  • tried that too but it prompts the following error : X.shape[1] = 159 should be equal to 287, does the number of samples in the training data need to be exactly same as the number of samples in the test data. – Neeraj Sharma Feb 27 '19 at 08:34

1 Answers1

0

the error you get is not about the number of samples but the number of features and this comes from those line of code:

cv = CountVectorizer(binary = True)
ct = CountVectorizer(binary= True)
cv.fit(feedback_data['data'].values)
ct.fit(feedback_data_test['data'].values)

You need to encode the test and the train the same way

You fit the Count Vectorizer on all the datas and then apply it to the test and train, if not you don't have the same vocabulary and thus not the same encoding.

cv = CountVectorizer(binary = True)
cv.fit(np.concatenate((feedback_data['data'].values,feedback_data_test['data'].values))

EDIT

you just don't use ct, only cv

X = feedback_data['data'].apply(lambda X : cv.transform([X])).values
X = list([list(x.toarray()[0]) for x in X])
X_test = feedback_data_test['data'].apply(lambda X_test :cv.transform([X_test])).values
X_test = list([list(x.toarray()[0]) for x in X_test])
Frayal
  • 2,117
  • 11
  • 17
  • the above code gives the error File "", line 4 X = feedback_data['data'].apply(lambda X : cv.transform([X])).values ^ SyntaxError: invalid syntax – Neeraj Sharma Feb 27 '19 at 10:34
  • well it's a copy of your code.... Could you please use the code i gave you in the other answers with only one Count Vectorizer. I feel like we are doing one step forward two step backward every time is give you an answer.... – Frayal Feb 27 '19 at 11:05