Python -- SciKit -- Text Feature Extraction of Classifer

Question

I have to classify articles into my custom categories. So I chose MultinomialNB from SciKit. I am doing supervised learning. So I have an editor who look at the articles daily and then tag them. Once they are tagged I include them into my Learning model and so on. Below is the code to get an idea what i am doing and using. (I am not including any import lines because I am just trying to give you an idea of what I am doing) (Reference)

corpus = (train_set)
vectorizer = HashingVectorizer(stop_words='english', non_negative=True) 
x = vectorizer.transform(corpus)
x_array = x.toarray()
data_array = np.array(x_array)

cat_set = list(cat_set)
cat_array = np.array(cat_set)
filename = '/home/ubuntu/Classifier/Intelligence-MultinomialNB.pkl'

if(not os.path.exists(filename)):
    classifier.partial_fit(data_array,cat_array,classes)
    print "Saving Classifier"
    joblib.dump(classifier, filename, compress=9)
else:
    print "Loading Classifier"
    classifier = joblib.load(filename)
    classifier.partial_fit(data_array,cat_array)
    print "Saving Classifier"
    joblib.dump(classifier, filename, compress=9)

Now I have a Classifier ready after custom tagging and it works well with new articles and work like a charm. Now the requirement has arisen to get most frequent words against each category. In short I have to extract feature from the learned model. By looking into documentation I only found out how to extract text features at the time of learning.

But once learned and I only have the model file (.pkl) is it possible to load that classifier and extract features from it?

Will it be possible to get the most frequent terms against each class or category?

score 2 · Answer 1 · answered Mar 13 '15 at 03:50

You can access the features by using the feature_count_ property. This will tell you how many times a particular feature occurred. For example:

# Imports
import numpy as np
from sklearn.naive_bayes import MultinomialNB

# Data
X   = np.random.randint(3, size=(3, 10))
X2  = np.random.randint(3, size=(3, 10))
y   = np.array([1, 2, 3])

# Initial fit
clf = MultinomialNB()
clf.fit(X, y)

# Check to see that the stored features are equal to the input features
print np.all(clf.feature_count_ == X)

# Modify fit with new data
clf.partial_fit(X2, y)

# Check to see that the stored features represents both sets of input
print np.all(clf.feature_count_ == (X + X2))

In the above example, we can see that the feature_count_ property is nothing more than a running sum of the number of features for each class. Using this, you can go backwards from your classifier model to your features, to determine the frequency of your features. Unfortunately, your problem is more complex, you now need to go back one more step, as your features are not simply words.

This is where the bad news comes - you used a HashingVectorizer feature extractor. If you refer to the docs:

there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

So even though we know the frequency of the features, we can't translate those features back to words. Had you used a different type of feature extractor (perhaps the one referenced on that same page, CountVectorizer) the situation would be different entirely.

In short - You can extract the features from the model and determine their frequency by class, but you can't convert those features back to words.

To obtain the functionality you desire you would need to start over using a reversible mapping function (a feature extractor that allows you to encode words into features and decode features back into words).

score 0 · Answer 2 · answered Mar 19 '15 at 07:35

I would suggest using the code below. you just need to load the pickel object and transform the test data using the same vectorizer. try just TFIDF vectorizer in case you face problems.

clf = joblib.load("'/home/ubuntu/Classifier/Intelligence-MultinomialNB.pkl'")
# you need to read the test sample 
# type (data_test) list of list 

X_test = vectorizer.transform(data_test)
print "pickel model loaded"
print clf
pred = clf.predict(X_test)
print ("prediction done")

for p in enumerate(pred):
    print p

Python -- SciKit -- Text Feature Extraction of Classifer

2 Answers2