1

I am trying to create a classifier to categorize websites. I am doing this for the very first time so it's all quite new to me. Currently I am trying to do some Bag of Words on a couple of parts of the web page (e.g. title, text, headings). It looks like this:

from sklearn.feature_extraction.text import CountVectorizer
countvect_text = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_title = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_headings = CountVectorizer(encoding="cp1252", stop_words="english")

X_tr_text_counts = countvect_text.fit_transform(tr_data['text'])
X_tr_title_counts = countvect_title.fit_transform(tr_data['title'])
X_tr_headings_counts = countvect_headings.fit_transform(tr_data['headings'])

from sklearn.feature_extraction.text import TfidfTransformer

transformer_text = TfidfTransformer(use_idf=True)
transformer_title = TfidfTransformer(use_idf=True)
transformer_headings = TfidfTransformer(use_idf=True)

X_tr_text_tfidf = transformer_text.fit_transform(X_tr_text_counts)
X_tr_title_tfidf = transformer_title.fit_transform(X_tr_title_counts)
X_tr_headings_tfidf = transformer_headings.fit_transform(X_tr_headings_counts)

from sklearn.naive_bayes import MultinomialNB
text_nb = MultinomialNB().fit(X_tr_text_tfidf, tr_data['class'])
title_nb = MultinomialNB().fit(X_tr_title_tfidf, tr_data['class'])
headings_nb = MultinomialNB().fit(X_tr_headings_tfidf, tr_data['class'])

X_te_text_counts = countvect_text.transform(te_data['text'])
X_te_title_counts = countvect_title.transform(te_data['title'])
X_te_headings_counts = countvect_headings.transform(te_data['headings'])

X_te_text_tfidf = transformer_text.transform(X_te_text_counts)
X_te_title_tfidf = transformer_title.transform(X_te_title_counts)
X_te_headings_tfidf = transformer_headings.transform(X_te_headings_counts)

accuracy_text = text_nb.score(X_te_text_tfidf, te_data['class'])
accuracy_title = title_nb.score(X_te_title_tfidf, te_data['class'])
accuracy_headings = headings_nb.score(X_te_headings_tfidf, te_data['class'])

This works fine, and I get the accuracies as expected. However, as you might have guessed, this looks cluttered and is filled with duplication. My question then is, is there a way to write this more concisely?

Additionally, I am not sure how I can combine these three features into a single multinomial classifier. I tried passing a list of tfidf values to MultinomialNB().fit(), but apparently that's not allowed.

Optionally, it would also be nice to add weights to the features, so that in the final classifier some vectors have a higher importance than others.

I'm guessing I need pipeline but I'm not at all sure how I should use it in this case.

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239

2 Answers2

2

The snippet below is a possible way to simplify your code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

cv = CountVectorizer(encoding="cp1252", stop_words="english")
tt = TfidfTransformer(use_idf=True)
mnb = MultinomialNB()

accuracy = {}
for item in ['text', 'title', 'headings']:
    X_tr_counts = cv.fit_transform(tr_data[item])
    X_tr_tfidf = tt.fit_transform(X_tr_counts)
    mnb.fit(X_tr_tfidf, tr_data['class'])
    X_te_counts = cv.transform(te_data[item])
    X_te_tfidf = tt.transform(X_te_counts)
    accuracy[item] = mnb.score(X_te_tfidf, te_data['class'])

The classification success rates are stored in a dictionary accuracy with keys 'text, 'title', and 'headings'.

EDIT

A more elegant solution - not necessarily simpler though - would consist in using Pipeline and FeatureUnion as pointed out by @Vivek Kumar. This approach would also allow you to combine all the features into a single model and apply weighting factors to the features extracted from the different items of your dataset.

First we import the necessary modules.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import FeatureUnion, Pipeline

Then we define a transformer class (as suggested in this example) to select the different items of your dataset:

class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, foo, bar=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

We are now ready to define the pipeline:

pipeline = Pipeline([
  ('features', FeatureUnion(
    transformer_list=[
      ('text_feats', Pipeline([
        ('text_selector', ItemSelector(key='text')),
        ('text_vectorizer', TfidfVectorizer(encoding="cp1252", 
                                            stop_words="english", 
                                            use_idf=True))
        ])),
      ('title_feats', Pipeline([
        ('title_selector', ItemSelector(key='text')),
        ('title_vectorizer', TfidfVectorizer(encoding="cp1252", 
                                             stop_words="english", 
                                             use_idf=True))
        ])),
      ('headings_feats', Pipeline([
        ('headings_selector', ItemSelector(key='text')),
        ('headings_vectorizer', TfidfVectorizer(encoding="cp1252", 
                                                stop_words="english", 
                                                use_idf=True))
        ])),
    ],
    transformer_weights={'text': 0.5,  #change weights as appropriate
                         'title': 0.3,
                         'headings': 0.2}
    )),
  ('classifier', MultinomialNB())
])

And finally, we can classify data in a straightforward manner:

pipeline.fit(tr_data, tr_data['class'])
pipeline.score(te_data, te_data['class'])
Community
  • 1
  • 1
Tonechas
  • 13,398
  • 16
  • 46
  • 80
2

First, CountVectorizer and TfidfTransformer can be removed by using TfidfVectorizer (which is essentially combination of both).

Second, the TfidfVectorizer and MultinomialNB can be combined in a Pipeline. A pipeline sequentially apply a list of transforms and a final estimator. When fit() is called on a Pipeline, it fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. And when score() or predict() is called, it only call transform() on all transformers and score() or predict() on last one.

So the code will look like:

from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vectorizer', TfidfVectorizer(encoding="cp1252",
                                                    stop_words="english",
                                                    use_idf=True)), 
                     ('nb', MultinomialNB())])

accuracy={}
for item in ['text', 'title', 'headings']:

    # No need to save the return of fit(), it returns self
    pipeline.fit(tr_data[item], tr_data['class'])

    # Apply transforms, and score with the final estimator
    accuracy[item] = pipeline.score(te_data[item], te_data['class'])

EDIT: Edited to include the combining of all features to get single accuracy:

To combine the results, we can follow multiple approaches. One that is easily understandable (but a bit of again going to the cluttery side) is the following:

# Using scipy to concatenate, because tfidfvectorizer returns sparse matrices
from scipy.sparse import hstack

def get_tfidf(tr_data, te_data, columns):

    train = None
    test = None

    tfidfVectorizer = TfidfVectorizer(encoding="cp1252",
                                      stop_words="english",
                                      use_idf=True)
    for item in columns:
        temp_train = tfidfVectorizer.fit_transform(tr_data[item])
        train = hstack((train, temp_train)) if train is not None else temp_train

        temp_test = tfidfVectorizer.transform(te_data[item])
        test = hstack((test , temp_test)) if test is not None else temp_test

    return train, test

train_tfidf, test_tfidf = get_tfidf(tr_data, te_data, ['text', 'title', 'headings']) 

nb = MultinomialNB()
nb.fit(train_tfidf, tr_data['class'])
nb.score(test_tfidf, te_data['class'])

Second approach (and more preferable) will be to include all these in pipeline. But due to selecting the different columns ('text', 'title', 'headings') and concatenating the results, its not that straightforward. We need to use FeatureUnion for them. And specifically the following example:

Third, if you are open to use other libraries, then DataFrameMapper from sklearn-pandas can simplify the usage of FeatureUnions used in previous example.

If you do want to go the second or third way, please feel free to contact if having any difficulties.

NOTE: I have not checked the code, but it should work (less some syntax errors, if any). Will check as soon as on my pc.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Thank you for the reply and code example! This returns a dict of accuracies. I was kind of hoping there was a way to really bring together all features and get a single accuracy score for the combined use of features. Or am I looking at it the wrong way? – Bram Vanroy May 27 '17 at 07:39
  • @BramVanroy I just simplified the code you posted. In that also you were returning 3 accuracies separately. Definitely we can combine the features (in fact that is the correct approach and thought why you were not doing it when saw your code). Have a look at [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) – Vivek Kumar May 27 '17 at 08:40
  • I thank you for that, as it is much more concise this way! However, part of the question was also how I could turn them into one feature (possibly with weights). I've checked Pipeline and FeatureUnion, but I don't see how to use it in my case... – Bram Vanroy May 27 '17 at 09:17
  • @BramVanroy Ah yes, Sorry for overlooking that. I will edit the answer to include them. – Vivek Kumar May 27 '17 at 10:50
  • No problem. In the mean while I am trying to see if I could run multiple features through the pipeline and then use GridSearch on that, but this gave me errors as well. Please see this question: https://stackoverflow.com/questions/44216021/reshape-pandas-df-to-use-in-gridsearch – Bram Vanroy May 27 '17 at 11:02
  • @BramVanroy For this you need the second solution (featureUnion and pipeline) solution. In the meanwhile, if your problem is solved for this question, please consider accepting it. – Vivek Kumar May 27 '17 at 12:58
  • Thanks Vivek. I'm going to try and figure out the Pipeline and FeatureUnion :) Thanks again! – Bram Vanroy May 27 '17 at 13:08
  • @BramVanroy To give you a start, from the example link I posted above for feature union, you need the custom ItemSelector class, which will select the column. – Vivek Kumar May 27 '17 at 13:12
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/145270/discussion-between-bram-vanroy-and-vivek-kumar). – Bram Vanroy May 27 '17 at 13:46