2

I have a process which is something like:

  • Build a word token pattern - naively, 2+ alphanumerics surrounded by word boundaries.
  • Tokenize a document, then lemmatize these tokens with nltk.
  • Add some "custom" stop words to sklearn's built-in English stop words. (Here, using just one company name in a reproducible example.)
  • Get term frequencies utilizing the above, with unigrams through 4-grams.

The issue is that (presumably because of the fact that tokenization occurs first?) multi-word stopwords (phrases) aren't being dropped.

Full example:

import re
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as ESW, CountVectorizer

# Make sure we have the corpora used by nltk's lemmatizer
try:
    nltk.data.find('corpora/wordnet')
except:
    nltk.download('wordnet')

# "Naive" token similar to that used by sklearn
TOKEN = re.compile(r'\b\w{2,}\b')

# Tokenize, then lemmatize these tokens
# Modified from:
# http://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return (self.wnl.lemmatize(t) for t in TOKEN.findall(doc))

# Add 1 more phrase to sklearn's stop word list
sw = ESW.union(frozenset(['sinclair broadcast group']))

vect = CountVectorizer(stop_words=sw, ngram_range=(1, 4),
                       tokenizer=LemmaTokenizer())

# These are nonsense babbling
docs = ["""And you ask Why You Are Sinclair Broadcast Group is Asking It""",
        """Why are you asking what Sinclair Broadcast Group and you"""]

tf = vect.fit_transform(docs)

To reiterate: the single-word stopwords have been removed properly, but the phrase remains:

vect.get_feature_names()

# ['ask',
#  'ask sinclair',
#  'ask sinclair broadcast',
#  'ask sinclair broadcast group',
#  'asking',
#  'asking sinclair',
#  'asking sinclair broadcast',
#  'asking sinclair broadcast group',
#  'broadcast',
#  'broadcast group',
#  'broadcast group asking',
#  'group',
#  'group asking',
#  'sinclair',
#  'sinclair broadcast',
#  'sinclair broadcast group',
#  'sinclair broadcast group asking']

How can I correct this?

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235

2 Answers2

1

From the documentation of CountVectorizer:

stop_words : string {‘english’}, list, or None (default)

If ‘english’, a built-in stop word list for English is used.

If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

And further down for the parameter token_pattern:

token_pattern : string

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

So it would only remove stop words if the result of analyzer(token) is equal to 'sinclair broadcast group'. But the default analyzer is 'word', meaning that stop word detection applies only to single words, since the tokens are defined by the default token_pattern as described above.

Tokens are not n-grams (rather, n-grams are made out of tokens, and stop-word removal appears to occur at the token level, prior to construction of n-grams).

As a quick check, you could change your custom stopword to just be 'sinclair' for the experiment, to see it can correctly remove that word when treating it as an isolated word.

In other words, you'll need to pass your own callable as analyzer to get it to apply analyzer logic to n-grams too, which you'll have to manually check for. But the default behavior assumes stopword detection cannot apply to n-grams, only to single words.

Below is an example of a custom analyzer function for your case. This is based on this answer ... note I didn't test it so there might be bugs.

def trigram_match(i, trigram, words):
    if i < len(words) - 2 and words[i:i + 3] == trigram:
        return True
    if (i > 0 and i < len(words) - 1) and words[i - 1:i + 2] == trigram:
        return True
    if i > 1 and words[i - 2:i + 1] == trigram:
        return True
    return False
    

def custom_analyzer(text):
    bad_trigram = ['sinclair', 'broadcasting', 'group']
    words = [str.lower(w) for w in re.findall(r'\w{2,}', text)]
    for i, w in enumerate(words):
        if w in sw or trigram_match(i, bad_trigram, words):
            continue
        yield w
    
    
Community
  • 1
  • 1
ely
  • 74,674
  • 34
  • 147
  • 228
  • 2
    [Here's an example](https://stackoverflow.com/a/21600406/567620) of a custom analyzer. In your case you would want to preprocess the `words` data structure before entering the loop and yielding values. Your preprocessing would check individual stopwords from the basic stopword set, and then check for 'sinclair' and check if the following two words also match, and then remove them all if so. – ely Feb 27 '18 at 18:38
  • Actually, you can probably do this inside the loop, which will be efficient. Just use `enumerate` so you can check the forward indices, and do not yield a word if it is a stopword *or* if it forms one of the pieces of your trigram stopword, based on index checking. – ely Feb 27 '18 at 18:39
  • @BradSolomon I updated the answer with an example of this that would be close to functional (minus any bugs from not testing) for your situation. – ely Feb 27 '18 at 18:57
  • thanks for putting in time on this, but unfortunately it's not getting me to where i need to be. I posted a working solution here – Brad Solomon Feb 27 '18 at 22:39
0

Here's a custom analyzer that works for me. It's a bit hacky, but effectively does all of the text processing in one step and is fairly fast:

from functools import partial
from itertools import islice
import re

import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS


def window(seq, n=3):
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result


class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc, stop_words):
        return tuple(self.wnl.lemmatize(i.lower()) for i in
                     re.findall(r'\b\w{3,}\b', doc)
                     if i.lower() not in stop_words)


def analyzer(doc, stop_words=None, stop_phr=None, ngram_range=(1, 4)):
    if not stop_words:
        stop_words = {}
    if not stop_phr:
        stop_phr = {}
    start, stop = ngram_range
    lt = LemmaTokenizer()
    words = lt(doc, stop_words=stop_words)
    for n in range(start, stop + 1):
        for ngram in window(words, n=n):
            res = ' '.join(ngram)
            if res not in stop_phr:
                yield res
    for w in words:
        yield w


analyzer_ = partial(analyzer, stop_words=ENGLISH_STOP_WORDS,
                    stop_phr={'sinclair broadcast group'})
vect = CountVectorizer(analyzer=analyzer_)
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235