I have a huge dictionary/dataframe of German words and how often they appeared in a huge text corpus. For example:
der 23245
die 23599
das 23959
eine 22000
dass 18095
Buch 15988
Büchern 1000
Arbeitsplatz-Management 949
Arbeitsplatz-Versicherung 800
Since words like "Buch" (book) and "Büchern" (books, but in a different declension form) have similar meanings, I want to add up their frequencies. Same thing with the articles "der, die, das", but not with the last two words that have completely different meanings even if they stem from the same words.
I tried the Levenshtein distance, which is "the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other." But I get bigger Levenshtein distances between "Buch" and "Bücher" than between "das" and "dass" (completely different meanings)
import enchant
string1 = "das"
string2 = "dass"
string3 = "Buch"
string4 = "Büchern"
print(enchant.utils.levenshtein(string1, string2))
print(enchant.utils.levenshtein(string3, string4))
>>>> 1
>>>> 4
Is there any other way to cluster such words efficiently?