2

I am working on a multiclass classificiation problem for text , where I have a lot of different classes (15+). I have trained a Linearsvc svm method(method is just and example). But it outputs just single class with highest probability, Is there a way that algorithm outputs two classes at the same time

sample code i am using:

from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
count_vect = CountVectorizer(max_df=.9,min_df=.002,  
                             encoding='latin-1', 
                             ngram_range=(1, 3))
X_train_counts = count_vect.fit_transform(df_upsampled['text'])
tfidf_transformer = TfidfTransformer(sublinear_tf=True,norm='l2')
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = LinearSVC().fit(X_train_tfidf, df_upsampled['reason'])
y_pred = model.predict(X_test)

current output :

    source  user   time    text         reason
0   hi      neha    0      0:neha:hi       1
1   there   ram     1      1:ram:there     1
2   ball    neha    2      2:neha:ball     3
3   item    neha    3      3:neha:item     6
4   go there ram    4      4:ram:go there  7
5   kk       ram    5      5:ram:kk        1
6   hshs    neha    6      6:neha:hshs     2
7   ggsgs   neha    7      7:neha:ggsgs    15

desired output:

    source  user   time    text         reason  reason2
0   hi      neha    0      0:neha:hi       1      2
1   there   ram     1      1:ram:there     1      6
2   ball    neha    2      2:neha:ball     3      7
3   item    neha    3      3:neha:item     6      4
4   go there ram    4      4:ram:go there  7      9
5   kk       ram    5      5:ram:kk        1      2
6   hshs    neha    6      6:neha:hshs     2      3
7   ggsgs   neha    7      7:neha:ggsgs    15     1

Its is okay if i get output in just one column as i can split and make two columns from it.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
Puneet Sinha
  • 1,041
  • 1
  • 10
  • 23

2 Answers2

3

LinearSVC does not provide predict_proba but it provides the decision_function which gives the signed distance from the hyperplane.

From Documentation:

decision_function(self, X):

Predict confidence scores for samples.

The confidence score for a sample is the signed distance of that sample to the hyperplane.

Based on @warped comments,

we can use decision_function output, to find the top n predicted classes from the model.

import pandas as pd 
from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

X, y = make_classification(n_samples=1000, 
                           n_clusters_per_class=1,
                           n_informative=10,
                           n_classes=5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=42)
clf = make_pipeline(StandardScaler(),
                    LinearSVC(random_state=0, tol=1e-5))
clf.fit(X, y)
top_n_classes = 2
predictions = clf.decision_function(
                    X_test).argsort()[:,-top_n_classes:][:,::-1]
pred_df = pd.DataFrame(predictions, 
                       columns= [f'{i+1}_pred' for i in range(top_n_classes)])

df = pd.DataFrame({'true_class': y_test})
df = df.assign(**pred_df)

df

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • thanks for you answer , i was looking for the final second class getting max probablity. But i will figure that out by sorting and taking the position index, if you have the solution ready which spilts out two reasons .. then kindly help me with that – Puneet Sinha May 26 '20 at 11:14
  • Just change the `top_n_classes=2`, you will get the top two reasons. – Venkatachalam May 26 '20 at 11:19
  • No Venkat , It is giving random result, I have checked it , the first column itself is not matching with clf.predict(), could you please try once at your end – Puneet Sinha May 26 '20 at 12:33
  • In the example that I have posted it is working fine. Probably your data does have enough information for the model to predict the right values. What is the test accuracy of your model? – Venkatachalam May 26 '20 at 12:36
  • i have did crossvalidation with linear svc and its giving 91% accuracy and after checking manually too , trained model is doing a good job. – Puneet Sinha May 26 '20 at 12:37
  • I no other suggestions, please very that you have implemented my solution correctly on your data – Venkatachalam May 26 '20 at 16:44
  • taking the absolute value of the decision function is incorrect. See edit to my post. – warped May 29 '20 at 06:43
  • yes, you are right. Thanks for correcting my answer. – Venkatachalam May 29 '20 at 09:27
  • PuneetSinha I have corrected my answer based on @warped inputs. Can you try it out now? – Venkatachalam May 30 '20 at 02:57
1

linearSVC has a method called decision_function, which gives confidence scores for individual classes:

The confidence score for a sample is the signed distance of that sample to the hyperplane.

Example with a 3-class dataset:

from sklearn.datasets import make_classification
import numpy as np    

# dummy dataset
X, y = make_classification(n_classes=3, n_clusters_per_class=1)

#train classifier and get decision scores
clf = LinearSVC().fit(X, y)
decision = clf.decision_function(X)
decision = np.round(decision, 2)

prediction = clf.predict(X)

# looking at decision scores and the predicted class:

for a, b in zip(decision, prediction):
    print(a, b)

[...]
[ 3.04 -0.61 -7.1 ] 0
[-4.99  1.85 -1.62] 1
[ 3.01 -0.98 -5.93] 0
[-2.61 -1.12  2.64] 2
[-3.43 -0.65  1.32] 2
[-1.78 -1.67  4.15] 2
[...]

you can see that the classifier takes the class with maximum score as prediction. 
To get the best two, you would take the two highest scores. 

Edit:

Note what signed distance means:

sign of the decision function:

+: yes (data point belongs to class)

-: no (data point does not belong to class)

absolute value of the decision function:

denotes confidence in the decision.

Example from the first row in the code above:

[ 3.04 -0.61 -7.1 ] 0   

Decision for class 1: 3.04 => this classifier thinks that the data belongs to class 1, with a certainty score of 3.04.

Decision for class 2: -.61 => this classifier thinks that the data does not belong to class 2, with a certainty score of .61.

Decision for class 3: -7.1 => this classifier thinks that the data does not belong to class 2, with a certainty score of 7.1.

warped
  • 8,947
  • 3
  • 22
  • 49