0

I have a large datasets with tweets in different languages. I want to apply a preprocessing function just to the sentences that are in german.

dataframe head

import time
import re
import sys
import nltk
nltk.download('stopwords')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()



def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('german')]
    return " ".join(filtered_words)

#run previous function

test['cleanText']=test.apply(lambda s:preprocess(s['text']) if s['lang'] == "de" else None) 

When I try to run the code, I get the following error

KeyError: 'lang'

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
Daniel AG
  • 47
  • 7
  • @yes, grazie mille! – Daniel AG Feb 04 '23 at 17:57
  • In your own words, where the code says `s['lang']`, why should it be possible to do that? What do you think will be the value of `s` when the `lambda` is used, and why should it have a `'lang'` key? – Karl Knechtel Feb 04 '23 at 17:57
  • @KarlKnechtel well, now it works. I forgot that apply() works column wise. By passing axis=1, it is possible to apply the function for each row. – Daniel AG Feb 04 '23 at 18:04

0 Answers0