I need to filter out non-core German words from a text using spaCy. However, I couldn't find a suitable approach or word list that covers only the essential vocabulary of the German language.
I have tried different approaches using the spacy tools nlp(word).has_vector
and nlp(word).vector_norm == 0
and using a list of words like list(nlp.vocab.strings)
from 'de_core_news_sm' or 'de_core_news_lg', but they either recognize irrelevant words as part of the German language or fail to recognize basic German words.
I'm looking for recommendations on how to obtain or create a word list that accurately covers only the core vocabulary of the German language, and can be used with (preferably) spaCy or other NLP packages. I would prefer using a universal, not german specific, language package, so that I can extend to other languages as easily.