I have a process which is something like:
- Build a word token pattern - naively, 2+ alphanumerics surrounded by word boundaries.
- Tokenize a document, then lemmatize these tokens with nltk.
- Add some "custom" stop words to sklearn's built-in English stop words. (Here, using just one company name in a reproducible example.)
- Get term frequencies utilizing the above, with unigrams through 4-grams.
The issue is that (presumably because of the fact that tokenization occurs first?) multi-word stopwords (phrases) aren't being dropped.
Full example:
import re
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as ESW, CountVectorizer
# Make sure we have the corpora used by nltk's lemmatizer
try:
nltk.data.find('corpora/wordnet')
except:
nltk.download('wordnet')
# "Naive" token similar to that used by sklearn
TOKEN = re.compile(r'\b\w{2,}\b')
# Tokenize, then lemmatize these tokens
# Modified from:
# http://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return (self.wnl.lemmatize(t) for t in TOKEN.findall(doc))
# Add 1 more phrase to sklearn's stop word list
sw = ESW.union(frozenset(['sinclair broadcast group']))
vect = CountVectorizer(stop_words=sw, ngram_range=(1, 4),
tokenizer=LemmaTokenizer())
# These are nonsense babbling
docs = ["""And you ask Why You Are Sinclair Broadcast Group is Asking It""",
"""Why are you asking what Sinclair Broadcast Group and you"""]
tf = vect.fit_transform(docs)
To reiterate: the single-word stopwords have been removed properly, but the phrase remains:
vect.get_feature_names()
# ['ask',
# 'ask sinclair',
# 'ask sinclair broadcast',
# 'ask sinclair broadcast group',
# 'asking',
# 'asking sinclair',
# 'asking sinclair broadcast',
# 'asking sinclair broadcast group',
# 'broadcast',
# 'broadcast group',
# 'broadcast group asking',
# 'group',
# 'group asking',
# 'sinclair',
# 'sinclair broadcast',
# 'sinclair broadcast group',
# 'sinclair broadcast group asking']
How can I correct this?