1

Could anyone point me to a solution/lib to instead of lemmatise, to do inflection(?). And for multiple languages (English, Dutch, German and French).

Or to give an example. I have the lemma 'science' for which I need the words 'sciences', 'scientific', 'scientifically'... returned. So plural and adjectives.

I looked into NLTK (cf Wordnet and Spacy), but did not find a solution.

dderom
  • 11
  • 2
  • 1
    I think the plural of "lemma" is "lemmata" – Stef Feb 26 '23 at 10:22
  • If you have a function `lemmatise` and a long list of all words in your language (for instance, the official Scrabble dictionary) then you can group the words by lemma in a python dict: `groups = {}; for word in list_of_words: groups.setdefault(lemmatise(word), []).append(word)` and now all words that have lemma `'science'` will be grouped in `groups['science']`. – Stef Feb 26 '23 at 10:24
  • Similar questions: [gerund form of a word in python?](https://stackoverflow.com/questions/64977817/gerund-form-of-a-word-in-python); [How to get inflections for a word?](https://stackoverflow.com/questions/9653815/how-to-get-inflections-for-a-word-using-wordnet) – Stef Feb 26 '23 at 11:02
  • Checkout: https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/ – alvas Feb 28 '23 at 17:48

1 Answers1

0

You can invert a lemmatise function by applying it to every word in the Scrabble dictionary, and grouping words with a common stem in a python dict.

Of course the groups will strongly depend on the lemmatise function you have. Below, I use nltk.stem.WordNetLemmatizer.lemmatize, which correctly groups 'science' and 'sciences' under the same stem 'science', but doesn't group 'scientific' with them.

So you'll need a more "brutal" lemmatise function that brings more words to the same stem.

import nltk
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
d = {}
with open('scrabble_dict.txt', 'r') as f:
    next(f); next(f) # skip header
    for word in f:
        word = word.strip().lower()
        d.setdefault(wnl.lemmatize(word), []).append(word)

print(d['science'])
# ['science', 'sciences']

print(d['scientific'])
# ['scientific']

print([stem for stem in d if stem.startswith('scien')])
# ['science', 'scienced', 'scient', 'scienter', 'sciential', 'scientific', 'scientifical', 'scientifically', 'scientificities', 'scientificity', 'scientise', 'scientised', 'scientises', 'scientising', 'scientism', 'scientisms', 'scientist', 'scientistic', 'scientize', 'scientized', 'scientizes', 'scientizing']

print(d['lemma'])
# ['lemma', 'lemmas', 'lemmata']

Stef
  • 13,242
  • 2
  • 17
  • 28