How to remove stopwords from a txt where the user specifies the language using Stanza

Question

I have many .txt files like this:

But they are on a few different languages so the user specifies it this way:

lng = input("In what language is the text typed? ('ca' for catalan, 'es' for spanish, 'en' for english...)\n")

I would like to delete all the stopwords and save the text on another .txt file. I'm using Stanza because I want to do sentiment analysis later on but I can't figure out how to do the stopword removal with it. I have tried it with Spacy because it's way faster but couldn't manage either. This is what I have tried:

import spacy

sp = spacy.load(str(lng) + '_core_web_sm') # the inputted language is stored in 'lng'
all_stopwords = sp.Defaults.stop_words
y = open('NODUP_FILTERED_' + filename, 'r', encoding='utf-8')
txt = y.read()
for line in range(rn):
    for word in txt:
        if word in all_stopwords:
            word = ''
print(txt)

Which returns me this traceback:

OSError: [E050] Can't find model 'es_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

Even though I have spacy and 'es_core_web_sm' installed.

score 0 · Answer 1 · answered Jul 20 '22 at 18:04

0

For Spanish there's es_core_news_sm, so I would simply create a mapping between the language and the model. E.g.

lang2model = {'en': 'en_core_web_sm', 'es': 'es_core_news_sm', [...] }
model = lang2model.get(lng, None)
if model:
  [...]

answered Jul 20 '22 at 18:04

dimid

7,285
1
46
85

How to remove stopwords from a txt where the user specifies the language using Stanza

1 Answers1