1

I have a huge dictionary/dataframe of German words and how often they appeared in a huge text corpus. For example:

der                                23245
die                                23599
das                                23959
eine                               22000
dass                               18095
Buch                               15988
Büchern                             1000
Arbeitsplatz-Management              949
Arbeitsplatz-Versicherung            800

Since words like "Buch" (book) and "Büchern" (books, but in a different declension form) have similar meanings, I want to add up their frequencies. Same thing with the articles "der, die, das", but not with the last two words that have completely different meanings even if they stem from the same words.

I tried the Levenshtein distance, which is "the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other." But I get bigger Levenshtein distances between "Buch" and "Bücher" than between "das" and "dass" (completely different meanings)

import enchant
string1 = "das"
string2 = "dass"
string3 = "Buch"
string4 = "Büchern"
print(enchant.utils.levenshtein(string1, string2))
print(enchant.utils.levenshtein(string3, string4))
>>>> 1
>>>> 4

Is there any other way to cluster such words efficiently?

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
johnnydoe
  • 382
  • 2
  • 12
  • 2
    You could try converting the words to embeddings, and measure their cosine distance. A pair of words with a short cosine distance between them should be closer in meaning. See https://www.deepset.ai/german-word-embeddings – Eric L Sep 13 '21 at 07:44
  • Thank you for the suggestion! Whereas the cosine distance makes sense to me, I am not really sure I understand the embedding conversion part. I already have the words with (most of) their different grammatical forms. – johnnydoe Sep 13 '21 at 07:50
  • 1
    Embeddings are vectors. Each word in your dictionary corresponds to a vector. You can download a pre-trained embeddings model (it's basically a word-vector lookup table) from the link in the first comment. – Eric L Sep 13 '21 at 07:55
  • @EricL Thank you, I'll look into it. For now I only tried something similar to the solution given here https://stackoverflow.com/questions/29484529/cosine-similarity-between-two-words-in-a-list, but the results are as bad as with the Levenshtein distance. – johnnydoe Sep 13 '21 at 08:02
  • 1
    you could stem the words first. using e.g. a snowball stemmer for german. See: http://snowball.tartarus.org/algorithms/german/stemmer.html . It basically reduces a word to only its stem. If you need more interpreted clustering ( e.g. "Job" being similar to "Work"), you should use some embedding as already suggested. There are great videos on youtube explaining how it works – Dennis Sep 13 '21 at 08:10
  • @Dennis thank you! I don't really need to look into synonyms, but German being so full of big words made out of smaller words, and also full of different grammatical forms, it's quite challenging to see how one should group them, like in the example in my post. – johnnydoe Sep 13 '21 at 08:52

1 Answers1

2

First, Buch and Bücher is pretty simple as they are just different morphologies of the same word. For both Buch and Bücher, there is only one version in the dictionary (called a lemma). As it happens, der, die and das are also just different morphologies of the lemma der. We just need to count the dictionary form of words (the lemmas) . Spacy has an easy way to access the lemma of a word, for example:

import spacy
from collections import Counter

nlp = spacy.load('de')
words = ['der', 'die', 'das', 'eine', 'dass', 'Buch', 'Büchern', 'Arbeitsplatz-Management','Arbeitsplatz-Versicherung']
lemmas = [nlp(a)[0].lemma_ for a in words]
counter = Counter(lemmas)

results in counter:

Counter({'der': 3, 'einen': 1, 'dass': 1, 'Buch': 2, 'Arbeitsplatz-Management': 1, 'Arbeitsplatz-Versicherung': 1})

chefhose
  • 2,399
  • 1
  • 21
  • 32
  • Thank you, but I did try spacy at the very beginning and spacy.load('de') gives me the following error: "OSError: [E941] Can't find model 'de'. It looks like you're trying to load a model from a shortcut, which is obsolete as of spaCy v3.0. To load the model, use its full name instead: nlp = spacy.load("de_core_news_sm")". Therefore, I tried the one they suggest and it can't find the model. Do you know how could I fix this, I have been struggling with it from the start and that's why I decided to try something else. – johnnydoe Sep 13 '21 at 13:08
  • 1
    You probably need to download the model first : ```python -m spacy download de``` – chefhose Sep 13 '21 at 13:14
  • Indeed, and then use nlp = spacy.load("de_core_news_sm"). – johnnydoe Sep 13 '21 at 13:35