-1

I'm predicting some value based on my training dataset and calculating probability, summing them which always gives me 1 or 100% This is my training data

Address                                                        Location_ID
Arham Brindavan,plot no.9,3rd road Near ls Stn,cannop          4485
Revanta,Behind nirmal puoto Mall, G-M link Road, Mulund(W)     10027
Sandhu Arambh,Opp St.Mary's Convent, rose rd, Mulund(W)        10027
Naman Premirer, Military Road, Marol Andheri E                 5041
Dattatreya Ayuedust Adobe Hanspal, bhubaneshwar                6479

this is my test data

Address                                                          Location_ID
Tata Vivati , Mhada Colony, Mulund (E), Mumbai                     10027
Evershine Madhuvan,Sen Nagar, Near blue Energy,Santacruz(E)        4943

This is what I have tried

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier

data=pd.read_csv('D:/All files/abc.csv')
msk = np.random.rand(len(data)) < 0.8
data_train = data[msk] 
data_train_add = data_train.ix[:,0] # divide dataset into training set
data_train_loc = data_train.ix[:,1] 

data_test1 = data[~msk]   
data_test = data_test1.ix[:,0]   # divide dataset into testing set            

data_train_add = np.array(data_train_add)
data_train_loc = np.array(data_train_loc)

count_vect = CountVectorizer(ngram_range=(1,3))
X_train_counts = count_vect.fit_transform(data_train_add.ravel())

tfidf_transformer = TfidfTransformer()
data_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf_svm = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, n_iter=5, random_state=42).fit(data_train_tfidf, data_train_loc.ravel())

X_new_counts = count_vect.transform(data_test.ravel())
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted_svm = clf_svm.predict(X_new_tfidf)

clf_svm_prob=clf_svm.predict_proba(X_new_tfidf) 
prob_sum=clf_svm_prob.sum(axis=1)
print(prob_sum)
O/P
 array([ 1.,  1.,  1.,  1.])

Why is it giving 1 or 100% probability, which parameter should I change so as to get the sum of probability correct..please suggest Thanks in advance.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Andre_k
  • 1,680
  • 3
  • 18
  • 41
  • It is summing the probabilities of all classes for that sample. Obviously its going to be 1. What do you expect? Can you explain a bit more, what you wanted to achieve? Do you want to sum the probabilities of a single class for all test samples? – Vivek Kumar May 31 '17 at 09:38
  • @VivekKumar yes I'm expecting that it should give me the sum of probability of the test record of each word...for instance if for this test data records(words)"Tata Vivati , Mhada Colony, Mulund (E), Mumbai ", the probability is 0.00023,0.07693,0.28811,0.198827,0.123121,0.05920, then it should add these probabilities only(summing all the above values gives approx 0.737 or 73 %) – Andre_k May 31 '17 at 10:24
  • 1
    `clf_svm` is a classification estimator. It will not output the word probability, only class. I am not able to understand what do you mean by word probability. – Vivek Kumar May 31 '17 at 11:49

1 Answers1

1

This works as expected, as the model you are training is discriminative not generative. So the probabilities you are obtaining are

[P(label1 | x), P(label2 | x), ..., P(labelK | x)]

and for any such probability distribution (over finite set of possible values label1 to labelK).

SUMi P(labeli | x) = 1

Discriminative models do not model P(X), there is literally nothing in it that can represent this quantity. Why? Because this makes learning much easier, and if you only care about label/value you never need P(X).

What you are after are the opposite quantities P(x | label1), since then

P(X) = SUMi P(x|labeli) P(labeli)

but P(x|labeli) is nowhere to be found in the discriminative models either. So, if you need access to P(X) you need to learn it explicitely for example using GMMs, Naive Bayes, etc. but not logistic regression which you are using now (which is a discriminative model).

lejlot
  • 64,777
  • 8
  • 131
  • 164