Apply python package (spaCy) word list only covering the specific language vocabulary

Question

I need to filter out non-core German words from a text using spaCy. However, I couldn't find a suitable approach or word list that covers only the essential vocabulary of the German language.

I have tried different approaches using the spacy tools nlp(word).has_vector and nlp(word).vector_norm == 0 and using a list of words like list(nlp.vocab.strings) from 'de_core_news_sm' or 'de_core_news_lg', but they either recognize irrelevant words as part of the German language or fail to recognize basic German words. I'm looking for recommendations on how to obtain or create a word list that accurately covers only the core vocabulary of the German language, and can be used with (preferably) spaCy or other NLP packages. I would prefer using a universal, not german specific, language package, so that I can extend to other languages as easily.

score 1 · Accepted Answer · answered Mar 09 '23 at 16:46

You can use a frequency-based approach, maybe. For this, you should use a frequency list that ranks words by their frequency of use in written or spoken German. Here is an example repo. Alternatively, you can create it on your own using a large corpus.

I can show a very basic version using spaCy:

Define a function to filter out non-core German words. The function should check if a token is in the frequency list.
Process your text and apply the function to each token in the processed text.

import spacy
import pandas as pd
import nltk

nlp = spacy.load("de_core_news_sm")
stemmer = nltk.stem.Cistem()

# Load a frequency list of German words
df = pd.read_csv('~/decow_wordfreq_cistem.csv', index_col=['word'])

# Define a function to filter out non-core German words
def is_core_german_word(token):
    return df.at[stemmer.stem(token.text.lower()), 'freq'] > 0

# Process your text
text = "Lass uns ein bisschen Spaß haben!"
doc = nlp(text)

# Filter out non-core German words
core_german_words = [token.text for token in doc if is_core_german_word(token)]

print(core_german_words)

Note that the quality of the results will depend on the quality and coverage of the frequency list you use. You may need to combine multiple approaches, such as using the CEFR levels or word embeddings, to obtain a word list that accurately covers only the core vocabulary of the German language.

I am aware that this is very language specific. But I thought it might be helpful if no other answer came up.

Hey, thanks for sharing! Using the frequency sounds interesting, since competence terms will have a low to zero frequency. So even if a non-core word is listed in the frequency list, I could detect it by the low frequency. So wouldn't it make sense to choose a frequency larger than zero? Else we treat the frequency list like a general word list, don't we? Thanks for sharing, also the list! It seems to have 42 mio. stemmed words..(?) That is definitely large enough and the frequency should handle it. I'll give it a try and update here. — Levin, Mar 09 '23 at 21:05
Yes, you're right. Sometimes, a word may be part of the core vocabulary but have a low frequency in the frequency list. In such cases, using a frequency threshold higher than zero can help filter out rare or specialized words. I think you could choose a threshold based on the distribution of frequencies in the list. Words that occur more frequently than the average frequency could be considered part of the core vocabulary, while words that occur less frequently could be considered non-core. But again, you might want to research more on that. — Bedir Tapkan, Mar 14 '23 at 00:54
I finally tested it, and it works like a charm! It takes a while to load and to look up, but I put it into a dictionary and then it works fine. Thank you! — Levin, Apr 24 '23 at 06:47

Apply python package (spaCy) word list only covering the specific language vocabulary

1 Answers1