I'm doing text classification with using Python and scikit-learn.
Now, I use TfidfVectorizer as vectorizer (for transform raw text to a feature vector) and MultinomialNB as a classifier. I use parameter ngram_range = (1,2) (see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html ), e.g. I use one word and bigrams.
After classification and test my algorithm in test set and CV set, I'd like to improve accuracy. I see the most informative features ( due to question How to get most informative features for scikit-learn classifiers? ). And I see, that in the set most informative features I have words ( ngram=1), that don't have impact to classification, but in bigram (words collocations) they will have great impact.
So, I can't use stop_words, because Tfidfvectorizer will not consider this words in collocations and I can't use preprocessor at the same reason. Question: How can I exclude some words in tfidfvectorizer, but save this words in different collocations?