CountVectorizer Error: ValueError: setting an array element with a sequence

Question

I am having a data set of 144 student feedback with 72 positive and 72 negative feedback respectively. The data set has two attributes namely data and target which contain the sentence and the sentiment(positive or negative) respectively. Consider the following code:

import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data)  


    data    target
0      facilitates good student teacher communication.  positive
1                           lectures are very lengthy.  negative
2             the teacher is very good at interaction.  positive
3                       good at clearing the concepts.  positive
4                       good at clearing the concepts.  positive
5                                    good at teaching.  positive
6                          does not shows test copies.  negative
7                           good subjective knowledge.  positive
8                           good communication skills.  positive
9                               good teaching methods.  positive
10   posseses very good and thorough knowledge of t...  positive
11   posseses superb ability to provide a lots of i...  positive
12   good conceptual skills and knowledge for subject.  positive
13                      no commuication outside class.  negative
14                                     rude behaviour.  negative
15            very negetive attitude towards students.  negative
16   good communication skills, lacks time punctual...  positive
17   explains in a better way by giving practical e...  positive
18                               hardly comes on time.  negative
19                          good communication skills.  positive
20   to make students comfortable with the subject,...  negative
21                       associated to original world.  positive
22                             lacks time punctuality.  negative

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data['data'].values)
X = feedback_data['data'].apply(lambda X : cv.transform([X])).values
X_test = cv.transform(feedback_data_test)

from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i<72 else 0 for i in range(144)]
print(target)

X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)

clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
#The below line gives the error
clf.fit(X , target)

I do not know what is wrong. Please help

Frayal · Accepted Answer · 2019-02-25T16:52:54.127

0

The error comes from the way X as been done. You cannot use directly X in the Fit method. You need first to transform it a little bit more (i could not have told you that for the other problem as i did not have the info)

right now you have the following:

array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
   ...
   <1x23 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>], dtype=object)

Which is enough to do a split. We are just going to transform it you can understand and so will the fit method:

X = list([list(x.toarray()[0]) for x in X])

What we do is convert the sparse matrix to a numpy array, take the first element (it has only one element) and then convert it to a list to make sure it has the right dimension.

Now why are we doing this:

X is something like that

>>>X[0]
   <1x23 sparse matrix of type '<class 'numpy.int64'>'
   with 5 stored elements in Compressed Sparse Row format>

so we transform it to see what it realy is:

>>>X[0].toarray()
   array([[0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
         0]], dtype=int64)

and then as you see there is a slight issue with the dimension so we take the first element.

going back to a list does nothing, it's just for you to understand well what you see. (you can dump it for speed)

your code is now this:

cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda X : cv.transform([X])).values
X = list([list(x.toarray()[0]) for x in X])
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
clf.fit(X, target)

edited Feb 25 '19 at 16:52

answered Feb 25 '19 at 16:24

Frayal

2,117
11
17

so should I convert the complete X to array and then apply the fit() method ? – Neeraj Sharma Feb 25 '19 at 16:30
Hi I did this for x in X: X = x.toarray() clf.fit(X, target) gives the error: ValueError: Found input variables with inconsistent numbers of samples: [1, 144] – Neeraj Sharma Feb 25 '19 at 16:36
'numpy.ndarray' object has no attribute 'toarray' – Neeraj Sharma Feb 25 '19 at 16:40
Hi I did this for x in X: X = x.toarray() clf.fit(X, target) gives the error: ValueError: Found input variables with inconsistent numbers of samples: [1, 144] and trying X = [list(x.toarray()[0]) for x in X] gives the error 'numpy.ndarray' object has no attribute 'toarray' – Neeraj Sharma Feb 25 '19 at 16:44
have you exactly copy paste? because i don t transform it to an array till the last second. x is a sparse matrix at this point...re run your code from the start with and only with the piece of code i gave you. you have tried something and it has changed X. (delete the previous comments to keep things readable) – Frayal Feb 25 '19 at 17:07
ok, that problem was solved thanks for the support you have provided. But I still need your help because the test data is giving me error when i am tring to print the accuracy. I have posted a separate question for it – Neeraj Sharma Feb 25 '19 at 17:20
i ll look into it when i have time (i ll found it don't worry) – Frayal Feb 25 '19 at 17:25
hey, good morning I tried that script. My test data set consists 106 unlabeled feedback. Similar to X, X_test is also a list X_test = feedback_data_test['data'].apply(lambda X_test : ct.transform([X_test])).values X_test = list([list(x.toarray()[0]) for x in X_test]) when I try to determine the accuracy by: print("Accuracy = %s" %accuracy_score(target,clf.predict([X_test])) ) It gives error : ValueError: Found array with dim 3. Estimator expected <= 2.. Please help. – Neeraj Sharma Feb 26 '19 at 06:44

CountVectorizer Error: ValueError: setting an array element with a sequence

1 Answers1