I am trying to create a classifier to categorize websites. I am doing this for the very first time so it's all quite new to me. Currently I am trying to do some Bag of Words on a couple of parts of the web page (e.g. title, text, headings). It looks like this:
from sklearn.feature_extraction.text import CountVectorizer
countvect_text = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_title = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_headings = CountVectorizer(encoding="cp1252", stop_words="english")
X_tr_text_counts = countvect_text.fit_transform(tr_data['text'])
X_tr_title_counts = countvect_title.fit_transform(tr_data['title'])
X_tr_headings_counts = countvect_headings.fit_transform(tr_data['headings'])
from sklearn.feature_extraction.text import TfidfTransformer
transformer_text = TfidfTransformer(use_idf=True)
transformer_title = TfidfTransformer(use_idf=True)
transformer_headings = TfidfTransformer(use_idf=True)
X_tr_text_tfidf = transformer_text.fit_transform(X_tr_text_counts)
X_tr_title_tfidf = transformer_title.fit_transform(X_tr_title_counts)
X_tr_headings_tfidf = transformer_headings.fit_transform(X_tr_headings_counts)
from sklearn.naive_bayes import MultinomialNB
text_nb = MultinomialNB().fit(X_tr_text_tfidf, tr_data['class'])
title_nb = MultinomialNB().fit(X_tr_title_tfidf, tr_data['class'])
headings_nb = MultinomialNB().fit(X_tr_headings_tfidf, tr_data['class'])
X_te_text_counts = countvect_text.transform(te_data['text'])
X_te_title_counts = countvect_title.transform(te_data['title'])
X_te_headings_counts = countvect_headings.transform(te_data['headings'])
X_te_text_tfidf = transformer_text.transform(X_te_text_counts)
X_te_title_tfidf = transformer_title.transform(X_te_title_counts)
X_te_headings_tfidf = transformer_headings.transform(X_te_headings_counts)
accuracy_text = text_nb.score(X_te_text_tfidf, te_data['class'])
accuracy_title = title_nb.score(X_te_title_tfidf, te_data['class'])
accuracy_headings = headings_nb.score(X_te_headings_tfidf, te_data['class'])
This works fine, and I get the accuracies as expected. However, as you might have guessed, this looks cluttered and is filled with duplication. My question then is, is there a way to write this more concisely?
Additionally, I am not sure how I can combine these three features into a single multinomial classifier. I tried passing a list of tfidf values to MultinomialNB().fit()
, but apparently that's not allowed.
Optionally, it would also be nice to add weights to the features, so that in the final classifier some vectors have a higher importance than others.
I'm guessing I need pipeline
but I'm not at all sure how I should use it in this case.