Python NLTK: search for occurrence of a word

Question

I use the brown corpus "brown.words()" which gives me a list of 1161192 words.

Now I want to find any occurrence of the word "have", so whenever in the corpus there is an "has", "had", "haven't" ect. I want to do something (could be pushing them into an array, could be a counter, could be something else.

Edit: Note that this question is about finding a matching word. If I search "have" I want a way to match it to "haven't" or "had", thus the .count() would not solve this problem as it dosen't help matching anything.

Example code I would use in case stemming/lemmatization would work:

def findWordFamily(findWord):
    wordFamily = []

    lmtzr = WordNetLemmatizer()

    findWord = lmtzr.lemmatize(findWord)
    for word in brown.words():
        lemma = lmtzr.lemmatize(word)
        if lemma == findWord:
            wordFamily.append(word)

    return wordFamily
print(findWordFamily("have"))
# ["have", "have", "had", "having","haven't", "having"]

But the problem is that:

for word in brown.words():
    lemma = lmtzr.lemmatize(word)
    # if word is "having" lemma also is "having" instead of "have"

Possible duplicate of [nltk function to count occurrences of certain words](https://stackoverflow.com/questions/22762893/nltk-function-to-count-occurrences-of-certain-words) — iam.Carrot, Mar 01 '18 at 20:58
did you even bother to read the question? .count() is useless because I don't want to count it, I want a way of matching it — Michael Baumgarn, Mar 02 '18 at 13:29

score 1 · Accepted Answer · answered Mar 02 '18 at 16:00

Before trying to match the words, you might want to do a little of pre-processing. So "has" or "haven't" end up "transformed" to "have".

I recommend you take a look at both stemming or lemmatizing:

NLTK's Wordnet Lemmatizer (one of my favorites): http://www.nltk.org/_modules/nltk/stem/wordnet.html

NLTK's stemmers: http://www.nltk.org/howto/stem.html

Note: for the lemmatizer to work well with verbs, you have to specify that they are in fact verbs.

nltk.stem.WordNetLemmatizer().lemmatize('having', 'v')

Hope this helps!

Python NLTK: search for occurrence of a word

1 Answers1