0

In an attempt to classify text I want to use SVM. I want to classify test data into one of the labels(health/adult) The training & test data are text files

I am using python's scikit library. While I was saving the text to txt files I encoded it in utf-8 that's why i am decoding them in the snippet. Here's my attempted code

String = String.decode('utf-8')
String2 = String2.decode('utf-8')
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
                                     token_pattern=r'\b\w+\b', min_df=1)

X_2 = bigram_vectorizer.fit_transform(String2).toarray()
X_1 = bigram_vectorizer.fit_transform(String).toarray()
X_train = np.array([X_1,X_2])
print type(X_train)
y = np.array([1, 2])
clf = SVC()
clf.fit(X_train, y)

#prepare test data
print(clf.predict(X))

This is the error I am getting

  File "/Users/guru/python_projects/implement_LDA/lda/apply.py", line 107, in <module>
    clf.fit(X_train, y)
  File "/Users/guru/python_projects/implement_LDA/lda/lib/python2.7/site-packages/sklearn/svm/base.py", line 150, in fit
    X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
  File "/Users/guru/python_projects/implement_LDA/lda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 373, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

When I searched for the error, I found some results but they even didn't help. I think I am logically wrong here in applying SVM model. Can someone give me a hint on this?

Ref: [1][2]

prashantitis
  • 1,797
  • 3
  • 23
  • 52

1 Answers1

2

You have to combine your samples, vectorize them and then fit the classifier. Like this:

String = String.decode('utf-8')
String2 = String2.decode('utf-8')
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
                                     token_pattern=r'\b\w+\b', min_df=1)

X_train = bigram_vectorizer.fit_transform(np.array([String, String2]))
print type(X_train)
y = np.array([1, 2])
clf = SVC()
clf.fit(X_train, y)

#prepare test data
print(clf.predict(bigram_vectorizer.transform(np.array([X1, X2, ...]))))

But 2 sample it's a very few amount of data so likely your prediction will not be accurate.

EDITED:

Also you can combine transformation and classification in one step using Pipeline.

from sklearn.pipeline import Pipeline

print type(X_train) # Should be a list of texts length 100 in your case
y_train = ... # Should be also a list of length 100
clf = Pipeline([
    ('transformer', CountVectorizer(...)),
    ('estimator', SVC()),
])
clf.fit(X_train, y_train)

X_test = np.array(["sometext"]) # array of test texts length = 1
print(clf.predict(X_test))
frist
  • 1,918
  • 12
  • 25
  • Okay, Thanks very much But i have two doubts If I pass only 1 test string in last line of your code(in clf.predict()) like this clf.predict(bigram_vectorizer.transform(corpus))) then I am getting output as [1,1,1....1,1,1] why? – prashantitis Jul 27 '16 at 12:43
  • This is because of vectorizer accepts sequence and the string is also a sequence. You have to reshape your input like that: `...transform(np.array([corpus]))` – frist Jul 27 '16 at 12:56
  • I am getting quite wrong prediction, Can you suggest some techniques to improve? apart from collecting more data which I can improvise in the same SVM model? list of custom stopwords should help? – prashantitis Jul 27 '16 at 13:24
  • moreover you were talking abput few amount of data on training set or test set? – prashantitis Jul 27 '16 at 13:31
  • @Guru first of all it's not enough to make any prediction based on model fitted by two samples. You need more data for training. Prediction actually doesn't train your model but only fitting does. Perhaps if you explain your task I could give you any suggestions. – frist Jul 27 '16 at 13:35
  • I am trying to identify the categories of website based on the text of webpages. For that train_health.txt contains text from healthcare websites (around 50)while train_adult.txt contains text from adult websites(around 50). test_data.txt contains text from only 1 website whose category needs to be predicted Hope you got my task ? – prashantitis Jul 27 '16 at 13:49
  • @fist Hope I was able to explain you the whole task, if there is still any confusion from my side, please do let me know – prashantitis Jul 27 '16 at 18:41
  • @Guru well, you have to create a train set from the content of your files (train_health.txt and train_adult.txt). It means you need to get an input array (list of 100 elements in your case), each element should represent one site. This will be your X_train and the category of each element will be your y_train. Please take a look at the answer, I've edited it. If you want to continue the discussion please provide your datasets. – frist Jul 28 '16 at 05:16