1

I want to classify every entry message.I work with Persian texts. I already implemented a text classifier with Naive Bayes. i did not use Tf-idf because every single feature is important for me. But i did some tricks to delete stop-words and pouncs to have a better accuracy.

I want to implement a text classifier with SVM but i searched a lot. all I found is related to using pipeline function with the use of Tf-idf. like below:

model = Pipeline([(‘vectorizer’, CountVectorizer()),
(‘tfidf’, TfidfTransformer()),
(‘clf’, OneVsRestClassifier(LinearSVC(class_weight=”balanced”)))])

now, how can i use SVM without Tf-idf?

thanks

Aaron_ab
  • 3,450
  • 3
  • 28
  • 42
hadi javanmard
  • 133
  • 1
  • 1
  • 9
  • could you provide more about the model you are trying to build? what are your features? words? are you using bag of words of the massage as your data? – thebeancounter Jan 01 '19 at 15:05
  • my model include body as text of messages and label. i have 6 labels. yes my data is made of some words that make sentences.@thebeancounter – hadi javanmard Jan 01 '19 at 17:36

1 Answers1

1

See here for the sklearn page about SVM, there you have a section for multiclass classification using SVM. You first have to convert your texts into a feature vector (numeric, if you wish you use SVM) If you should like to use bag of words you could use this SO question and this manual page of sklearn

You can use pre written python code to create BOW from you texts doing something like that - mind you, i geathered the relevant information for the OP - that was unclear and not compatible with SO strandarts, so you might need to work the code a bit for it to fit your excact usage.

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> vectorizer = CountVectorizer()
>>> vectorizer                     
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                              
<4x9 sparse matrix of type '<... 'numpy.int64'>'
    with 19 stored elements in Compressed Sparse ... format>

Then you might need to convert x into a dense matrix (depends on sklearn version) Then you could feed x into SVM model you can create like so

>>>>from sklearn import svm
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
>>> clf.fit(X, Y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes
thebeancounter
  • 4,261
  • 8
  • 61
  • 109