2

I'm trying to do an exercise of NLP in Kaggle and when I'm doing the data cleaning of the text that I have to use to predict the output I can't get it to be separated by words, instead I get one sentence with all the words attached.

This is my text_cleaner function:

def text_cleaner(text):
    text = str(text).lower() #lowercase
    text = re.sub('\d+', '', text) #remove numbers
    text = re.sub('\[.*?\]','', text) #remove html tags
    text = re.sub(r'https?://\S+|www\.\S+','',text) #remove url
    text = re.sub(r'\bhtml\b', '', text) #remove html word
    
    text = re.sub(r'['
                           u'\U0001F600-\U0001F64F'  # emoticons
                           u'\U0001F300-\U0001F5FF'  # symbols & pictographs
                           u'\U0001F680-\U0001F6FF'  # transport & map symbols
                           u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
                           u'\U00002702-\U000027B0' 
                           u'\U000024C2-\U0001F251'  #removes emojis
                           ']+', '',text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #removes punctuation
    text = re.sub('[^a-z]','',text) #removes non-alphabeticals
    text = text.replace('#', '')
    text = text.replace('@', '')
    text = stop_words(text)
    
    return text

def stop_words(text):
    lem = WordNetLemmatizer()
    stop = set(stopwords.words('english'))
    stop.remove('not')
    punctuation = list(string.punctuation)
    stop.update(punctuation)
    
    text =text.split()
    text= [lem.lemmatize(word) for word in text if word not in stop]
    text = ' '.join(text)
    
    return text

And this is the result that I got:

ourdeedsarethereasonofthisearthquakemayallahfo...

instead of:

deed reason earthquake may allah forgive u...

Thanks!

ewz93
  • 2,444
  • 1
  • 4
  • 12

1 Answers1

0

This line text = re.sub('[^a-z]','',text) #removes non-alphabeticals will remove everything except the lowercase characters a to z, including whitespaces.

If you replace it with re.sub('[^a-z ]','',text), so "remove everything except a to z or spaces", it should work.

Also all of this:

text = re.sub(r'['
                           u'\U0001F600-\U0001F64F'  # emoticons
                           u'\U0001F300-\U0001F5FF'  # symbols & pictographs
                           u'\U0001F680-\U0001F6FF'  # transport & map symbols
                           u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
                           u'\U00002702-\U000027B0' 
                           u'\U000024C2-\U0001F251'  #removes emojis
                           ']+', '',text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #removes punctuation

and this:

text = text.replace('#', '')
text = text.replace('@', '')

will not do anything as all these lines do is removing certain single characters, but all of these characters are already removed by this re.sub('[^a-z ]','',text).

ewz93
  • 2,444
  • 1
  • 4
  • 12